Performance Modeling and Optimization of Deadline-Driven Pig Programs

IF 2.2 4区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Zhuoyao Zhang, L. Cherkasova, Abhishek Verma, B. T. Loo
{"title":"Performance Modeling and Optimization of Deadline-Driven Pig Programs","authors":"Zhuoyao Zhang, L. Cherkasova, Abhishek Verma, B. T. Loo","doi":"10.1145/2518017.2518019","DOIUrl":null,"url":null,"abstract":"Many applications associated with live business intelligence are written as complex data analysis programs defined by directed acyclic graphs of MapReduce jobs, for example, using Pig, Hive, or Scope frameworks. An increasing number of these applications have additional requirements for completion time guarantees. In this article, we consider the popular Pig framework that provides a high-level SQL-like abstraction on top of MapReduce engine for processing large data sets. There is a lack of performance models and analysis tools for automated performance management of such MapReduce jobs. We offer a performance modeling environment for Pig programs that automatically profiles jobs from the past runs and aims to solve the following inter-related problems: (i) estimating the completion time of a Pig program as a function of allocated resources; (ii) estimating the amount of resources (a number of map and reduce slots) required for completing a Pig program with a given (soft) deadline. First, we design a basic performance model that accurately predicts completion time and required resource allocation for a Pig program that is defined as a sequence of MapReduce jobs: predicted completion times are within 10% of the measured ones. Second, we optimize a Pig program execution by enforcing the optimal schedule of its concurrent jobs. For DAGs with concurrent jobs, this optimization helps reducing the program completion time: 10%--27% in our experiments. Moreover, it eliminates possible nondeterminism of concurrent jobs’ execution in the Pig program, and therefore, enables a more accurate performance model for Pig programs. Third, based on these optimizations, we propose a refined performance model for Pig programs with concurrent jobs. The proposed approach leads to significant resource savings (20%--60% in our experiments) compared with the original, unoptimized solution. We validate our solution using a 66-node Hadoop cluster and a diverse set of workloads: PigMix benchmark, TPC-H queries, and customized queries mining a collection of HP Labs’ web proxy logs.","PeriodicalId":50919,"journal":{"name":"ACM Transactions on Autonomous and Adaptive Systems","volume":"71 1","pages":"14:1-14:28"},"PeriodicalIF":2.2000,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Autonomous and Adaptive Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/2518017.2518019","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 17

Abstract

Many applications associated with live business intelligence are written as complex data analysis programs defined by directed acyclic graphs of MapReduce jobs, for example, using Pig, Hive, or Scope frameworks. An increasing number of these applications have additional requirements for completion time guarantees. In this article, we consider the popular Pig framework that provides a high-level SQL-like abstraction on top of MapReduce engine for processing large data sets. There is a lack of performance models and analysis tools for automated performance management of such MapReduce jobs. We offer a performance modeling environment for Pig programs that automatically profiles jobs from the past runs and aims to solve the following inter-related problems: (i) estimating the completion time of a Pig program as a function of allocated resources; (ii) estimating the amount of resources (a number of map and reduce slots) required for completing a Pig program with a given (soft) deadline. First, we design a basic performance model that accurately predicts completion time and required resource allocation for a Pig program that is defined as a sequence of MapReduce jobs: predicted completion times are within 10% of the measured ones. Second, we optimize a Pig program execution by enforcing the optimal schedule of its concurrent jobs. For DAGs with concurrent jobs, this optimization helps reducing the program completion time: 10%--27% in our experiments. Moreover, it eliminates possible nondeterminism of concurrent jobs’ execution in the Pig program, and therefore, enables a more accurate performance model for Pig programs. Third, based on these optimizations, we propose a refined performance model for Pig programs with concurrent jobs. The proposed approach leads to significant resource savings (20%--60% in our experiments) compared with the original, unoptimized solution. We validate our solution using a 66-node Hadoop cluster and a diverse set of workloads: PigMix benchmark, TPC-H queries, and customized queries mining a collection of HP Labs’ web proxy logs.
截止日期驱动的清管器项目性能建模与优化
许多与实时商业智能相关的应用程序被编写为复杂的数据分析程序,由MapReduce作业的有向无环图定义,例如,使用Pig、Hive或Scope框架。越来越多的此类应用程序对完成时间保证有额外的要求。在本文中,我们考虑流行的Pig框架,它在MapReduce引擎之上提供类似sql的高级抽象,用于处理大型数据集。目前还缺乏对这类MapReduce作业进行自动化性能管理的性能模型和分析工具。我们为Pig程序提供了一个性能建模环境,可以自动分析过去运行的作业,旨在解决以下相互关联的问题:(i)估计Pig程序的完成时间作为分配资源的函数;(ii)估算在给定(软)截止日期内完成Pig程序所需的资源量(地图和减少槽的数量)。首先,我们设计了一个基本的性能模型,可以准确地预测一个Pig程序的完成时间和所需的资源分配,该程序被定义为一系列MapReduce作业:预测的完成时间在实际完成时间的10%以内。其次,我们通过执行并发作业的最佳调度来优化Pig程序的执行。对于具有并发作业的dag,此优化有助于减少程序完成时间:在我们的实验中减少了10%- 27%。此外,它消除了Pig程序中并发作业执行的不确定性,因此可以为Pig程序提供更准确的性能模型。第三,基于这些优化,我们提出了具有并发作业的Pig程序的改进性能模型。与原始的、未优化的解决方案相比,所提出的方法可以显著节省资源(在我们的实验中为20%- 60%)。我们使用一个66节点的Hadoop集群和一组不同的工作负载来验证我们的解决方案:PigMix基准测试,TPC-H查询,以及挖掘HP实验室web代理日志集合的自定义查询。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
ACM Transactions on Autonomous and Adaptive Systems
ACM Transactions on Autonomous and Adaptive Systems 工程技术-计算机:理论方法
CiteScore
4.80
自引率
7.40%
发文量
9
审稿时长
>12 weeks
期刊介绍: TAAS addresses research on autonomous and adaptive systems being undertaken by an increasingly interdisciplinary research community -- and provides a common platform under which this work can be published and disseminated. TAAS encourages contributions aimed at supporting the understanding, development, and control of such systems and of their behaviors. TAAS addresses research on autonomous and adaptive systems being undertaken by an increasingly interdisciplinary research community - and provides a common platform under which this work can be published and disseminated. TAAS encourages contributions aimed at supporting the understanding, development, and control of such systems and of their behaviors. Contributions are expected to be based on sound and innovative theoretical models, algorithms, engineering and programming techniques, infrastructures and systems, or technological and application experiences.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信