Enhancing MapReduce via Asynchronous Data Processing

2010 IEEE 16th International Conference on Parallel and Distributed Systems Pub Date : 2010-12-08 DOI:10.1109/ICPADS.2010.116

M. Elteir, Heshan Lin, Wu-chun Feng

{"title":"Enhancing MapReduce via Asynchronous Data Processing","authors":"M. Elteir, Heshan Lin, Wu-chun Feng","doi":"10.1109/ICPADS.2010.116","DOIUrl":null,"url":null,"abstract":"The Map Reduce programming model simplifies large-scale data processing on commodity clusters by having users specify a map function that processes input key/value pairs to generate intermediate key/value pairs, and a reduce function that merges and converts intermediate key/value pairs into final results. Typical Map Reduce implementations such as Hadoop enforce barrier synchronization between the map and reduce phases, i.e., the reduce phase does not start until all map tasks are finished. In turn, this synchronization requirement can cause inefficient utilization of computing resources and can adversely impact performance. Thus, we present and evaluate two different approaches to cope with the synchronization drawback of existing Map Reduce implementations. The first approach, hierarchical reduction, starts a reduce task as soon as a predefined number of map tasks completes, it then aggregates the results of different reduce tasks following a tree structure. The second approach, incremental reduction, starts a predefined number of reduce tasks from the beginning and has each reduce task incrementally reduce records collected from map tasks. Together with our performance modeling, we evaluate different reducing approaches with two real applications on a 32-node cluster. The experimental results have shown that incremental reduction outperforms hierarchical reduction in general. Also, incremental reduction can speed-up the original Hadoop implementation by up to 35.33% for the word count application and 57.98% for the grep application. In addition, incremental reduction outperforms the original Hadoop in an emulated cloud environment with heterogeneous compute nodes.","PeriodicalId":365914,"journal":{"name":"2010 IEEE 16th International Conference on Parallel and Distributed Systems","volume":"268 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"50","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 IEEE 16th International Conference on Parallel and Distributed Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPADS.2010.116","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 50

Abstract

The Map Reduce programming model simplifies large-scale data processing on commodity clusters by having users specify a map function that processes input key/value pairs to generate intermediate key/value pairs, and a reduce function that merges and converts intermediate key/value pairs into final results. Typical Map Reduce implementations such as Hadoop enforce barrier synchronization between the map and reduce phases, i.e., the reduce phase does not start until all map tasks are finished. In turn, this synchronization requirement can cause inefficient utilization of computing resources and can adversely impact performance. Thus, we present and evaluate two different approaches to cope with the synchronization drawback of existing Map Reduce implementations. The first approach, hierarchical reduction, starts a reduce task as soon as a predefined number of map tasks completes, it then aggregates the results of different reduce tasks following a tree structure. The second approach, incremental reduction, starts a predefined number of reduce tasks from the beginning and has each reduce task incrementally reduce records collected from map tasks. Together with our performance modeling, we evaluate different reducing approaches with two real applications on a 32-node cluster. The experimental results have shown that incremental reduction outperforms hierarchical reduction in general. Also, incremental reduction can speed-up the original Hadoop implementation by up to 35.33% for the word count application and 57.98% for the grep application. In addition, incremental reduction outperforms the original Hadoop in an emulated cloud environment with heterogeneous compute nodes.

查看原文本刊更多论文

通过异步数据处理增强MapReduce

Map Reduce编程模型简化了商品集群上的大规模数据处理，它让用户指定一个Map函数来处理输入的键/值对以生成中间键/值对，以及一个Reduce函数来合并中间键/值对并将其转换为最终结果。典型的Map Reduce实现，如Hadoop，在Map和Reduce阶段之间强制barrier同步，也就是说，Reduce阶段直到所有Map任务完成后才开始。反过来，这种同步需求可能导致计算资源的低效利用，并可能对性能产生不利影响。因此，我们提出并评估了两种不同的方法来解决现有Map Reduce实现的同步缺点。第一种方法是分层约简，它在预定义数量的映射任务完成后立即启动一个约简任务，然后按照树结构聚合不同的约简任务的结果。第二种方法是增量减少，从一开始就启动预定义数量的减少任务，并让每个减少任务增量地减少从映射任务收集的记录。结合我们的性能建模，我们用一个32节点集群上的两个实际应用程序评估了不同的简化方法。实验结果表明，增量约简总体上优于分层约简。此外，对于单词计数应用程序，增量减少可以将原始Hadoop实现的速度提高35.33%，对于grep应用程序，可以提高57.98%。此外，在具有异构计算节点的模拟云环境中，增量缩减性能优于原始Hadoop。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2010 IEEE 16th International Conference on Parallel and Distributed Systems

自引率

0.00%

发文量