To Overlap or Not to Overlap: Optimizing Incremental MapReduce Computations for On-Demand Data Upload

2014 5th International Workshop on Data-Intensive Computing in the Clouds Pub Date : 2014-11-16 DOI:10.1109/DataCloud.2014.7

Stefan Ene, Bogdan Nicolae, Alexandru Costan, Gabriel Antoniu

{"title":"To Overlap or Not to Overlap: Optimizing Incremental MapReduce Computations for On-Demand Data Upload","authors":"Stefan Ene, Bogdan Nicolae, Alexandru Costan, Gabriel Antoniu","doi":"10.1109/DataCloud.2014.7","DOIUrl":null,"url":null,"abstract":"Research on cloud-based Big Data analytics has focused so far on optimizing the performance and cost-effectiveness of the computations, while largely neglecting an important aspect: users need to upload massive datasets on clouds for their computations. This paper studies the problem of running MapReduce applications when considering the simultaneous optimization of performance and cost of both the data upload and its corresponding computation taken together. We analyze the feasibility of incremental MapReduce approaches to advance the computation as much as possible during the data upload by using already transferred data to calculate intermediate results. Our key finding shows that overlapping the transfer time with as many incremental computations as possible is not always efficient: a better solution is to wait for enough to fill the computational capacity of the MapReduce cluster. Results show significant performance and cost reduction compared with state-of-the-art solutions that leverage incremental computations in a naive fashion.","PeriodicalId":121831,"journal":{"name":"2014 5th International Workshop on Data-Intensive Computing in the Clouds","volume":"38 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 5th International Workshop on Data-Intensive Computing in the Clouds","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DataCloud.2014.7","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Research on cloud-based Big Data analytics has focused so far on optimizing the performance and cost-effectiveness of the computations, while largely neglecting an important aspect: users need to upload massive datasets on clouds for their computations. This paper studies the problem of running MapReduce applications when considering the simultaneous optimization of performance and cost of both the data upload and its corresponding computation taken together. We analyze the feasibility of incremental MapReduce approaches to advance the computation as much as possible during the data upload by using already transferred data to calculate intermediate results. Our key finding shows that overlapping the transfer time with as many incremental computations as possible is not always efficient: a better solution is to wait for enough to fill the computational capacity of the MapReduce cluster. Results show significant performance and cost reduction compared with state-of-the-art solutions that leverage incremental computations in a naive fashion.

查看原文本刊更多论文

重叠或不重叠:优化按需数据上传的增量MapReduce计算

迄今为止，基于云的大数据分析的研究主要集中在优化计算的性能和成本效益上，而很大程度上忽略了一个重要方面:用户需要在云上上传大量数据集进行计算。本文研究了在同时优化数据上传及其相应计算的性能和成本的情况下运行MapReduce应用程序的问题。我们分析了增量MapReduce方法的可行性，通过使用已经传输的数据来计算中间结果，在数据上传过程中尽可能提前计算。我们的关键发现表明，将传输时间与尽可能多的增量计算重叠并不总是有效的:一个更好的解决方案是等待足够的时间来填充MapReduce集群的计算能力。结果显示，与最先进的解决方案相比，以朴素的方式利用增量计算可以显著降低性能和成本。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 5th International Workshop on Data-Intensive Computing in the Clouds

自引率

0.00%

发文量