Improved Intermediate Data Management for MapReduce Frameworks

2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2020-05-01 DOI:10.1109/IPDPS47924.2020.00062

Haoyu Wang, Haiying Shen, Charles Reiss, A. Jain, Yunqiao Zhang

{"title":"Improved Intermediate Data Management for MapReduce Frameworks","authors":"Haoyu Wang, Haiying Shen, Charles Reiss, A. Jain, Yunqiao Zhang","doi":"10.1109/IPDPS47924.2020.00062","DOIUrl":null,"url":null,"abstract":"MapReduce is a popular distributed framework for big data analysis. However, the current MapReduce framework is insufficiently efficient in handling intermediate data, which may cause bottlenecks in I/O operations, computation, and network bandwidth. Previous work addresses the I/O problem by aggregating map task outputs (i.e. intermediate data) for each single reduce task on one machine. Unfortunately, when there are a large number of reduce tasks, their concurrent requests for intermediate data generate a large amount of I/O operations. In this paper, we present APA (Aggregation, Partition, and Allocation), a new intermediate data management system for the MapReduce framework. APA aggregates the intermediate data from the map tasks in each rack to one file, and the file host pushes the needed intermediate data to each reduce task. Thus, it reduces the number of disk seeks involved in handling intermediate data within one job. Rather than evenly distributing the intermediate data among reduce tasks based on the keys as in current MapReduce, APA partitions the intermediate data to balance the execution latency of different reduce tasks. APA further decides where to allocate each reduce task to minimize the intermediate data transmission time between map tasks and reduce tasks. Through experiments on a real MapReduce Hadoop cluster using the HiBench benchmark suite, we show that APA improves the performance of the current Hadoop by 40%-50%.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"77 1","pages":"536-545"},"PeriodicalIF":0.0000,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS47924.2020.00062","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

MapReduce is a popular distributed framework for big data analysis. However, the current MapReduce framework is insufficiently efficient in handling intermediate data, which may cause bottlenecks in I/O operations, computation, and network bandwidth. Previous work addresses the I/O problem by aggregating map task outputs (i.e. intermediate data) for each single reduce task on one machine. Unfortunately, when there are a large number of reduce tasks, their concurrent requests for intermediate data generate a large amount of I/O operations. In this paper, we present APA (Aggregation, Partition, and Allocation), a new intermediate data management system for the MapReduce framework. APA aggregates the intermediate data from the map tasks in each rack to one file, and the file host pushes the needed intermediate data to each reduce task. Thus, it reduces the number of disk seeks involved in handling intermediate data within one job. Rather than evenly distributing the intermediate data among reduce tasks based on the keys as in current MapReduce, APA partitions the intermediate data to balance the execution latency of different reduce tasks. APA further decides where to allocate each reduce task to minimize the intermediate data transmission time between map tasks and reduce tasks. Through experiments on a real MapReduce Hadoop cluster using the HiBench benchmark suite, we show that APA improves the performance of the current Hadoop by 40%-50%.

查看原文本刊更多论文

改进的MapReduce框架中间数据管理

MapReduce是一个流行的大数据分析分布式框架。但是，目前的MapReduce框架在处理中间数据方面效率不高，可能会造成I/O操作、计算和网络带宽瓶颈。以前的工作通过在一台机器上为每个单个reduce任务聚合映射任务输出(即中间数据)来解决I/O问题。不幸的是，当有大量的reduce任务时，它们对中间数据的并发请求会产生大量的I/O操作。在本文中，我们提出了APA (Aggregation, Partition, and Allocation)，一个新的MapReduce框架的中间数据管理系统。APA将来自每个机架中map任务的中间数据聚合到一个文件中，文件主机将所需的中间数据推送给每个reduce任务。因此，它减少了在一个作业中处理中间数据所涉及的磁盘寻道次数。不同于当前MapReduce基于键值将中间数据均匀地分布在reduce任务之间，APA将中间数据分区以平衡不同reduce任务的执行延迟。APA进一步决定每个reduce任务的分配位置，以最小化map任务和reduce任务之间的中间数据传输时间。通过使用HiBench基准测试套件在真实的MapReduce Hadoop集群上的实验，我们表明APA将当前Hadoop的性能提高了40%-50%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

自引率

0.00%

发文量