On data skewness, stragglers, and MapReduce progress indicators

Emilio Coppa, Irene Finocchi
{"title":"On data skewness, stragglers, and MapReduce progress indicators","authors":"Emilio Coppa, Irene Finocchi","doi":"10.1145/2806777.2806843","DOIUrl":null,"url":null,"abstract":"We tackle the problem of predicting the performance of MapReduce applications designing accurate progress indicators, which keep programmers informed on the percentage of completed computation time during the execution of a job. This is especially important in pay-as-you-go cloud environments, where slow jobs can be aborted in order to avoid excessive costs. Performance predictions can also serve as a building block for several profile-guided optimizations. By assuming that the running time depends linearly on the input size, state-of-the-art techniques can be seriously harmed by data skewness, load unbalancing, and straggling tasks. We thus design a novel profile-guided progress indicator, called NearestFit, that operates without the linear hypothesis assumption in a fully online way (i.e., without resorting to profile data collected from previous executions). NearestFit exploits a careful combination of nearest neighbor regression and statistical curve fitting techniques. Fine-grained profiles required by our theoretical progress model are approximated through space- and time-efficient data streaming algorithms. We implemented NearestFit on top of Hadoop 2.6.0. An extensive empirical assessment over the Amazon EC2 platform on a variety of benchmarks shows that its accuracy is very good, even when competitors incur non-negligible errors and wide prediction fluctuations.","PeriodicalId":275158,"journal":{"name":"Proceedings of the Sixth ACM Symposium on Cloud Computing","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"33","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Sixth ACM Symposium on Cloud Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2806777.2806843","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 33

Abstract

We tackle the problem of predicting the performance of MapReduce applications designing accurate progress indicators, which keep programmers informed on the percentage of completed computation time during the execution of a job. This is especially important in pay-as-you-go cloud environments, where slow jobs can be aborted in order to avoid excessive costs. Performance predictions can also serve as a building block for several profile-guided optimizations. By assuming that the running time depends linearly on the input size, state-of-the-art techniques can be seriously harmed by data skewness, load unbalancing, and straggling tasks. We thus design a novel profile-guided progress indicator, called NearestFit, that operates without the linear hypothesis assumption in a fully online way (i.e., without resorting to profile data collected from previous executions). NearestFit exploits a careful combination of nearest neighbor regression and statistical curve fitting techniques. Fine-grained profiles required by our theoretical progress model are approximated through space- and time-efficient data streaming algorithms. We implemented NearestFit on top of Hadoop 2.6.0. An extensive empirical assessment over the Amazon EC2 platform on a variety of benchmarks shows that its accuracy is very good, even when competitors incur non-negligible errors and wide prediction fluctuations.
关于数据偏度,掉队,MapReduce进度指标
我们解决了预测MapReduce应用程序性能的问题,设计了精确的进度指标,让程序员了解在执行任务期间完成的计算时间的百分比。这在现收现付的云环境中尤其重要,在这种环境中,可以终止慢速作业,以避免过高的成本。性能预测还可以作为若干配置文件引导优化的构建块。假设运行时间线性地依赖于输入大小,那么最先进的技术可能会受到数据偏态、负载不平衡和分散任务的严重损害。因此,我们设计了一种新的概要文件引导的进度指示器,称为NearestFit,它以完全在线的方式运行,而不需要线性假设(即,不依赖于从以前执行中收集的概要文件数据)。NearestFit利用了最近邻回归和统计曲线拟合技术的仔细组合。我们的理论进展模型所需要的细粒度剖面是通过空间和时间效率高的数据流算法来近似的。我们在Hadoop 2.6.0之上实现了NearestFit。对Amazon EC2平台在各种基准测试上的广泛经验评估表明,即使竞争对手产生不可忽略的错误和广泛的预测波动,其准确性也非常好。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信