Sampling-Based Partitioning in MapReduce for Skewed Data

2012 Seventh ChinaGrid Annual Conference Pub Date : 2012-09-20 DOI:10.1109/ChinaGrid.2012.18

Yujie Xu, Peng Zou, W. Qu, Zhiyang Li, Keqiu Li, Xiaoli Cui

{"title":"Sampling-Based Partitioning in MapReduce for Skewed Data","authors":"Yujie Xu, Peng Zou, W. Qu, Zhiyang Li, Keqiu Li, Xiaoli Cui","doi":"10.1109/ChinaGrid.2012.18","DOIUrl":null,"url":null,"abstract":"MapReduce, as a popular tool for distributed and scalable processing of voluminous data, has been used in many areas. However, it is not efficient when handing skewed data, since it only considers the key and adopts a uniform hash method to distribute the workload to each reducer, while ignores the key's distribution. This can lead to load imbalance, increase the processing time, generate the \"straggler\" and the final result is the performance degradation. The current approach to solve this problem usually adopts the asynchronous Map and Reduce to gather the distribution of keys' frequencies and make a partition scheme in advance, but it will cost too much waiting time. In this paper, we address the problem of how to efficiently and effectively partition the intermediate key to balance the load of each reducer when skewed data exists. We use a sampling MapReduce job to gather the distribution of keys'frequencies, estimate the overall distribution and make a partition scheme in advance. Then, we apply it to the map phase of the expected MapReduce job. This design not only provides a load-balanced partition scheme, but also keeps the high performance of synchronous mode in MapReduce. We also propose two partition schemes based on the sampling results in this paper: cluster combination optimization and cluster partition combination. The experimental results show that the first partition scheme is suitable for the data set that has a lighter skew, while cluster partition combination has a greater time and load balancing advantage when the data skew is heavier.","PeriodicalId":371382,"journal":{"name":"2012 Seventh ChinaGrid Annual Conference","volume":"66 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"32","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 Seventh ChinaGrid Annual Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ChinaGrid.2012.18","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 32

Abstract

MapReduce, as a popular tool for distributed and scalable processing of voluminous data, has been used in many areas. However, it is not efficient when handing skewed data, since it only considers the key and adopts a uniform hash method to distribute the workload to each reducer, while ignores the key's distribution. This can lead to load imbalance, increase the processing time, generate the "straggler" and the final result is the performance degradation. The current approach to solve this problem usually adopts the asynchronous Map and Reduce to gather the distribution of keys' frequencies and make a partition scheme in advance, but it will cost too much waiting time. In this paper, we address the problem of how to efficiently and effectively partition the intermediate key to balance the load of each reducer when skewed data exists. We use a sampling MapReduce job to gather the distribution of keys'frequencies, estimate the overall distribution and make a partition scheme in advance. Then, we apply it to the map phase of the expected MapReduce job. This design not only provides a load-balanced partition scheme, but also keeps the high performance of synchronous mode in MapReduce. We also propose two partition schemes based on the sampling results in this paper: cluster combination optimization and cluster partition combination. The experimental results show that the first partition scheme is suitable for the data set that has a lighter skew, while cluster partition combination has a greater time and load balancing advantage when the data skew is heavier.

查看原文本刊更多论文

MapReduce中基于抽样的倾斜数据分区

MapReduce作为一种流行的分布式和可扩展处理海量数据的工具，已经在许多领域得到了应用。但是，在处理倾斜数据时效率不高，因为它只考虑键并采用统一的哈希方法将工作负载分配给每个reducer，而忽略了键的分布。这会导致负载不平衡，增加处理时间，产生“掉队者”，最终导致性能下降。目前解决该问题的方法通常采用异步Map和Reduce来收集键的频率分布并提前制定分区方案，但这会耗费过多的等待时间。在本文中，我们解决了在存在倾斜数据的情况下，如何高效且有效地划分中间键以平衡每个reducer的负载的问题。我们使用采样MapReduce作业来收集键的频率分布，估计总体分布，并提前制定分区方案。然后，我们将其应用到预期MapReduce作业的映射阶段。该设计既提供了负载均衡的分区方案，又保持了MapReduce同步模式的高性能。在此基础上提出了两种分区方案:聚类组合优化和聚类分区组合。实验结果表明，第一种分区方案适用于数据倾斜较轻的数据集，而集群分区组合在数据倾斜较重的情况下具有更大的时间和负载平衡优势。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2012 Seventh ChinaGrid Annual Conference

自引率

0.00%

发文量