A self-tuning system based on application Profiling and Performance Analysis for optimizing Hadoop MapReduce cluster configuration

Dili Wu, A. Gokhale
{"title":"A self-tuning system based on application Profiling and Performance Analysis for optimizing Hadoop MapReduce cluster configuration","authors":"Dili Wu, A. Gokhale","doi":"10.1109/HiPC.2013.6799133","DOIUrl":null,"url":null,"abstract":"One of the most widely used frameworks for programming MapReduce-based applications is Apache Hadoop. Despite its popularity, however, application developers face numerous challenges in using the Hadoop framework, which stem from them having to effectively manage the resources of a MapReduce cluster, and configuring the framework in a way that will optimize the performance and reliability of MapReduce applications running on it. This paper addresses these problems by presenting the Profiling and Performance Analysis-based System (PPABS) framework, which automates the tuning of Hadoop configuration settings based on deduced application performance requirements. The PPABS framework comprises two distinct phases called the Analyzer, which trains PPABS to form a set of equivalence classes of MapReduce applications for which the most appropriate Hadoop config- uration parameters that maximally improve performance for that class are determined, and the Recognizer, which classifies an incoming unknown job to one of these equivalence classes so that its Hadoop configuration parameters can be self-tuned. The key research contributions in the Analyzer phase includes modifications to the well-known k - means + + clustering and Simulated Annealing algorithms, which were required to adapt them to the MapReduce paradigm. The key contributions in the Recognizer phase includes an approach to classify an unknown, incoming job to one of the equivalence classes and a control strategy to self-tune the Hadoop cluster configuration parameters for that job. Experimental results comparing the performance improvements for three different classes of applications running on Hadoop clusters deployed on Amazon EC2 show promising results.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"56","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"20th Annual International Conference on High Performance Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HiPC.2013.6799133","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 56

Abstract

One of the most widely used frameworks for programming MapReduce-based applications is Apache Hadoop. Despite its popularity, however, application developers face numerous challenges in using the Hadoop framework, which stem from them having to effectively manage the resources of a MapReduce cluster, and configuring the framework in a way that will optimize the performance and reliability of MapReduce applications running on it. This paper addresses these problems by presenting the Profiling and Performance Analysis-based System (PPABS) framework, which automates the tuning of Hadoop configuration settings based on deduced application performance requirements. The PPABS framework comprises two distinct phases called the Analyzer, which trains PPABS to form a set of equivalence classes of MapReduce applications for which the most appropriate Hadoop config- uration parameters that maximally improve performance for that class are determined, and the Recognizer, which classifies an incoming unknown job to one of these equivalence classes so that its Hadoop configuration parameters can be self-tuned. The key research contributions in the Analyzer phase includes modifications to the well-known k - means + + clustering and Simulated Annealing algorithms, which were required to adapt them to the MapReduce paradigm. The key contributions in the Recognizer phase includes an approach to classify an unknown, incoming job to one of the equivalence classes and a control strategy to self-tune the Hadoop cluster configuration parameters for that job. Experimental results comparing the performance improvements for three different classes of applications running on Hadoop clusters deployed on Amazon EC2 show promising results.
基于应用分析和性能分析的自调优系统,用于优化Hadoop MapReduce集群配置
用于编程基于mapreduce的应用程序的最广泛使用的框架之一是Apache Hadoop。尽管它很受欢迎,但是应用程序开发人员在使用Hadoop框架时面临着许多挑战,这源于他们必须有效地管理MapReduce集群的资源,并以优化运行在其上的MapReduce应用程序的性能和可靠性的方式配置框架。本文通过介绍基于性能分析和性能分析的系统(PPABS)框架来解决这些问题,该框架可以根据推断的应用程序性能需求自动调整Hadoop配置设置。PPABS框架包括两个不同的阶段,称为Analyzer,它训练PPABS形成MapReduce应用程序的一组等价类,为这些等价类确定最合适的Hadoop配置参数,从而最大限度地提高该类的性能;以及Recognizer,它将传入的未知作业分类到这些等价类之一,以便其Hadoop配置参数可以自调。Analyzer阶段的主要研究贡献包括对众所周知的k - means++聚类和模拟退火算法的修改,这些算法需要使它们适应MapReduce范式。识别器阶段的关键贡献包括一种将未知的传入作业分类到一个等价类的方法,以及一种为该作业自调优Hadoop集群配置参数的控制策略。在Amazon EC2上部署的Hadoop集群上运行的三种不同类型的应用程序的性能改进的比较实验结果显示了令人鼓舞的结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信