Haoyu Wang, Haiying Shen, Charles Reiss, A. Jain, Yunqiao Zhang
{"title":"Improved Intermediate Data Management for MapReduce Frameworks","authors":"Haoyu Wang, Haiying Shen, Charles Reiss, A. Jain, Yunqiao Zhang","doi":"10.1109/IPDPS47924.2020.00062","DOIUrl":null,"url":null,"abstract":"MapReduce is a popular distributed framework for big data analysis. However, the current MapReduce framework is insufficiently efficient in handling intermediate data, which may cause bottlenecks in I/O operations, computation, and network bandwidth. Previous work addresses the I/O problem by aggregating map task outputs (i.e. intermediate data) for each single reduce task on one machine. Unfortunately, when there are a large number of reduce tasks, their concurrent requests for intermediate data generate a large amount of I/O operations. In this paper, we present APA (Aggregation, Partition, and Allocation), a new intermediate data management system for the MapReduce framework. APA aggregates the intermediate data from the map tasks in each rack to one file, and the file host pushes the needed intermediate data to each reduce task. Thus, it reduces the number of disk seeks involved in handling intermediate data within one job. Rather than evenly distributing the intermediate data among reduce tasks based on the keys as in current MapReduce, APA partitions the intermediate data to balance the execution latency of different reduce tasks. APA further decides where to allocate each reduce task to minimize the intermediate data transmission time between map tasks and reduce tasks. Through experiments on a real MapReduce Hadoop cluster using the HiBench benchmark suite, we show that APA improves the performance of the current Hadoop by 40%-50%.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"77 1","pages":"536-545"},"PeriodicalIF":0.0000,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS47924.2020.00062","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
MapReduce is a popular distributed framework for big data analysis. However, the current MapReduce framework is insufficiently efficient in handling intermediate data, which may cause bottlenecks in I/O operations, computation, and network bandwidth. Previous work addresses the I/O problem by aggregating map task outputs (i.e. intermediate data) for each single reduce task on one machine. Unfortunately, when there are a large number of reduce tasks, their concurrent requests for intermediate data generate a large amount of I/O operations. In this paper, we present APA (Aggregation, Partition, and Allocation), a new intermediate data management system for the MapReduce framework. APA aggregates the intermediate data from the map tasks in each rack to one file, and the file host pushes the needed intermediate data to each reduce task. Thus, it reduces the number of disk seeks involved in handling intermediate data within one job. Rather than evenly distributing the intermediate data among reduce tasks based on the keys as in current MapReduce, APA partitions the intermediate data to balance the execution latency of different reduce tasks. APA further decides where to allocate each reduce task to minimize the intermediate data transmission time between map tasks and reduce tasks. Through experiments on a real MapReduce Hadoop cluster using the HiBench benchmark suite, we show that APA improves the performance of the current Hadoop by 40%-50%.