On data skewness, stragglers, and MapReduce progress indicators

Proceedings of the Sixth ACM Symposium on Cloud Computing Pub Date : 2015-03-31 DOI:10.1145/2806777.2806843

Emilio Coppa, Irene Finocchi

{"title":"On data skewness, stragglers, and MapReduce progress indicators","authors":"Emilio Coppa, Irene Finocchi","doi":"10.1145/2806777.2806843","DOIUrl":null,"url":null,"abstract":"We tackle the problem of predicting the performance of MapReduce applications designing accurate progress indicators, which keep programmers informed on the percentage of completed computation time during the execution of a job. This is especially important in pay-as-you-go cloud environments, where slow jobs can be aborted in order to avoid excessive costs. Performance predictions can also serve as a building block for several profile-guided optimizations. By assuming that the running time depends linearly on the input size, state-of-the-art techniques can be seriously harmed by data skewness, load unbalancing, and straggling tasks. We thus design a novel profile-guided progress indicator, called NearestFit, that operates without the linear hypothesis assumption in a fully online way (i.e., without resorting to profile data collected from previous executions). NearestFit exploits a careful combination of nearest neighbor regression and statistical curve fitting techniques. Fine-grained profiles required by our theoretical progress model are approximated through space- and time-efficient data streaming algorithms. We implemented NearestFit on top of Hadoop 2.6.0. An extensive empirical assessment over the Amazon EC2 platform on a variety of benchmarks shows that its accuracy is very good, even when competitors incur non-negligible errors and wide prediction fluctuations.","PeriodicalId":275158,"journal":{"name":"Proceedings of the Sixth ACM Symposium on Cloud Computing","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"33","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Sixth ACM Symposium on Cloud Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2806777.2806843","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 33

Abstract

We tackle the problem of predicting the performance of MapReduce applications designing accurate progress indicators, which keep programmers informed on the percentage of completed computation time during the execution of a job. This is especially important in pay-as-you-go cloud environments, where slow jobs can be aborted in order to avoid excessive costs. Performance predictions can also serve as a building block for several profile-guided optimizations. By assuming that the running time depends linearly on the input size, state-of-the-art techniques can be seriously harmed by data skewness, load unbalancing, and straggling tasks. We thus design a novel profile-guided progress indicator, called NearestFit, that operates without the linear hypothesis assumption in a fully online way (i.e., without resorting to profile data collected from previous executions). NearestFit exploits a careful combination of nearest neighbor regression and statistical curve fitting techniques. Fine-grained profiles required by our theoretical progress model are approximated through space- and time-efficient data streaming algorithms. We implemented NearestFit on top of Hadoop 2.6.0. An extensive empirical assessment over the Amazon EC2 platform on a variety of benchmarks shows that its accuracy is very good, even when competitors incur non-negligible errors and wide prediction fluctuations.

查看原文本刊更多论文

关于数据偏度，掉队，MapReduce进度指标

我们解决了预测MapReduce应用程序性能的问题，设计了精确的进度指标，让程序员了解在执行任务期间完成的计算时间的百分比。这在现收现付的云环境中尤其重要，在这种环境中，可以终止慢速作业，以避免过高的成本。性能预测还可以作为若干配置文件引导优化的构建块。假设运行时间线性地依赖于输入大小，那么最先进的技术可能会受到数据偏态、负载不平衡和分散任务的严重损害。因此，我们设计了一种新的概要文件引导的进度指示器，称为NearestFit，它以完全在线的方式运行，而不需要线性假设(即，不依赖于从以前执行中收集的概要文件数据)。NearestFit利用了最近邻回归和统计曲线拟合技术的仔细组合。我们的理论进展模型所需要的细粒度剖面是通过空间和时间效率高的数据流算法来近似的。我们在Hadoop 2.6.0之上实现了NearestFit。对Amazon EC2平台在各种基准测试上的广泛经验评估表明，即使竞争对手产生不可忽略的错误和广泛的预测波动，其准确性也非常好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the Sixth ACM Symposium on Cloud Computing

自引率

0.00%

发文量