Yujie Xu, Peng Zou, W. Qu, Zhiyang Li, Keqiu Li, Xiaoli Cui
{"title":"Sampling-Based Partitioning in MapReduce for Skewed Data","authors":"Yujie Xu, Peng Zou, W. Qu, Zhiyang Li, Keqiu Li, Xiaoli Cui","doi":"10.1109/ChinaGrid.2012.18","DOIUrl":null,"url":null,"abstract":"MapReduce, as a popular tool for distributed and scalable processing of voluminous data, has been used in many areas. However, it is not efficient when handing skewed data, since it only considers the key and adopts a uniform hash method to distribute the workload to each reducer, while ignores the key's distribution. This can lead to load imbalance, increase the processing time, generate the \"straggler\" and the final result is the performance degradation. The current approach to solve this problem usually adopts the asynchronous Map and Reduce to gather the distribution of keys' frequencies and make a partition scheme in advance, but it will cost too much waiting time. In this paper, we address the problem of how to efficiently and effectively partition the intermediate key to balance the load of each reducer when skewed data exists. We use a sampling MapReduce job to gather the distribution of keys'frequencies, estimate the overall distribution and make a partition scheme in advance. Then, we apply it to the map phase of the expected MapReduce job. This design not only provides a load-balanced partition scheme, but also keeps the high performance of synchronous mode in MapReduce. We also propose two partition schemes based on the sampling results in this paper: cluster combination optimization and cluster partition combination. The experimental results show that the first partition scheme is suitable for the data set that has a lighter skew, while cluster partition combination has a greater time and load balancing advantage when the data skew is heavier.","PeriodicalId":371382,"journal":{"name":"2012 Seventh ChinaGrid Annual Conference","volume":"66 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"32","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 Seventh ChinaGrid Annual Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ChinaGrid.2012.18","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 32
Abstract
MapReduce, as a popular tool for distributed and scalable processing of voluminous data, has been used in many areas. However, it is not efficient when handing skewed data, since it only considers the key and adopts a uniform hash method to distribute the workload to each reducer, while ignores the key's distribution. This can lead to load imbalance, increase the processing time, generate the "straggler" and the final result is the performance degradation. The current approach to solve this problem usually adopts the asynchronous Map and Reduce to gather the distribution of keys' frequencies and make a partition scheme in advance, but it will cost too much waiting time. In this paper, we address the problem of how to efficiently and effectively partition the intermediate key to balance the load of each reducer when skewed data exists. We use a sampling MapReduce job to gather the distribution of keys'frequencies, estimate the overall distribution and make a partition scheme in advance. Then, we apply it to the map phase of the expected MapReduce job. This design not only provides a load-balanced partition scheme, but also keeps the high performance of synchronous mode in MapReduce. We also propose two partition schemes based on the sampling results in this paper: cluster combination optimization and cluster partition combination. The experimental results show that the first partition scheme is suitable for the data set that has a lighter skew, while cluster partition combination has a greater time and load balancing advantage when the data skew is heavier.