A self-tuning system based on application Profiling and Performance Analysis for optimizing Hadoop MapReduce cluster configuration

20th Annual International Conference on High Performance Computing Pub Date : 2013-12-01 DOI:10.1109/HiPC.2013.6799133

Dili Wu, A. Gokhale

{"title":"A self-tuning system based on application Profiling and Performance Analysis for optimizing Hadoop MapReduce cluster configuration","authors":"Dili Wu, A. Gokhale","doi":"10.1109/HiPC.2013.6799133","DOIUrl":null,"url":null,"abstract":"One of the most widely used frameworks for programming MapReduce-based applications is Apache Hadoop. Despite its popularity, however, application developers face numerous challenges in using the Hadoop framework, which stem from them having to effectively manage the resources of a MapReduce cluster, and configuring the framework in a way that will optimize the performance and reliability of MapReduce applications running on it. This paper addresses these problems by presenting the Profiling and Performance Analysis-based System (PPABS) framework, which automates the tuning of Hadoop configuration settings based on deduced application performance requirements. The PPABS framework comprises two distinct phases called the Analyzer, which trains PPABS to form a set of equivalence classes of MapReduce applications for which the most appropriate Hadoop config- uration parameters that maximally improve performance for that class are determined, and the Recognizer, which classifies an incoming unknown job to one of these equivalence classes so that its Hadoop configuration parameters can be self-tuned. The key research contributions in the Analyzer phase includes modifications to the well-known k - means + + clustering and Simulated Annealing algorithms, which were required to adapt them to the MapReduce paradigm. The key contributions in the Recognizer phase includes an approach to classify an unknown, incoming job to one of the equivalence classes and a control strategy to self-tune the Hadoop cluster configuration parameters for that job. Experimental results comparing the performance improvements for three different classes of applications running on Hadoop clusters deployed on Amazon EC2 show promising results.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"56","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"20th Annual International Conference on High Performance Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HiPC.2013.6799133","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 56

Abstract

One of the most widely used frameworks for programming MapReduce-based applications is Apache Hadoop. Despite its popularity, however, application developers face numerous challenges in using the Hadoop framework, which stem from them having to effectively manage the resources of a MapReduce cluster, and configuring the framework in a way that will optimize the performance and reliability of MapReduce applications running on it. This paper addresses these problems by presenting the Profiling and Performance Analysis-based System (PPABS) framework, which automates the tuning of Hadoop configuration settings based on deduced application performance requirements. The PPABS framework comprises two distinct phases called the Analyzer, which trains PPABS to form a set of equivalence classes of MapReduce applications for which the most appropriate Hadoop config- uration parameters that maximally improve performance for that class are determined, and the Recognizer, which classifies an incoming unknown job to one of these equivalence classes so that its Hadoop configuration parameters can be self-tuned. The key research contributions in the Analyzer phase includes modifications to the well-known k - means + + clustering and Simulated Annealing algorithms, which were required to adapt them to the MapReduce paradigm. The key contributions in the Recognizer phase includes an approach to classify an unknown, incoming job to one of the equivalence classes and a control strategy to self-tune the Hadoop cluster configuration parameters for that job. Experimental results comparing the performance improvements for three different classes of applications running on Hadoop clusters deployed on Amazon EC2 show promising results.

查看原文本刊更多论文

基于应用分析和性能分析的自调优系统，用于优化Hadoop MapReduce集群配置

用于编程基于mapreduce的应用程序的最广泛使用的框架之一是Apache Hadoop。尽管它很受欢迎，但是应用程序开发人员在使用Hadoop框架时面临着许多挑战，这源于他们必须有效地管理MapReduce集群的资源，并以优化运行在其上的MapReduce应用程序的性能和可靠性的方式配置框架。本文通过介绍基于性能分析和性能分析的系统(PPABS)框架来解决这些问题，该框架可以根据推断的应用程序性能需求自动调整Hadoop配置设置。PPABS框架包括两个不同的阶段，称为Analyzer，它训练PPABS形成MapReduce应用程序的一组等价类，为这些等价类确定最合适的Hadoop配置参数，从而最大限度地提高该类的性能;以及Recognizer，它将传入的未知作业分类到这些等价类之一，以便其Hadoop配置参数可以自调。Analyzer阶段的主要研究贡献包括对众所周知的k - means++聚类和模拟退火算法的修改，这些算法需要使它们适应MapReduce范式。识别器阶段的关键贡献包括一种将未知的传入作业分类到一个等价类的方法，以及一种为该作业自调优Hadoop集群配置参数的控制策略。在Amazon EC2上部署的Hadoop集群上运行的三种不同类型的应用程序的性能改进的比较实验结果显示了令人鼓舞的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

20th Annual International Conference on High Performance Computing

自引率

0.00%

发文量