MapReduce中基于折叠压缩树的均衡分区机制

2018 IEEE 8th International Symposium on Cloud and Service Computing (SC2) Pub Date : 2018-11-01 DOI:10.1109/SC2.2018.00020

Hsing-Lung Chen, Syu-Huan Chen

{"title":"MapReduce中基于折叠压缩树的均衡分区机制","authors":"Hsing-Lung Chen, Syu-Huan Chen","doi":"10.1109/SC2.2018.00020","DOIUrl":null,"url":null,"abstract":"The MapReduce has emerged as an efficient platform for coping with big data. It achieves this goal by decoupling the data and then distributing the workloads to multiple reducers for processing in a fully parallel manner. Zipf's law asserts that, for many types of data studied in the physical and social sciences, the frequency of any event is inversely proportional to its rank in the frequency table, i.e. the key distribution is skewed. However, the hash function of MapReduce usually generates the unbalanced workloads to multiple reducers for the skewed data. The unbalanced workloads to multiple reducers lead to degrading the performance of MapReduce significantly, because the overall running time of a map-reduce cycle is determined by the longest running reducer. Thus, it is an important issue to develop a balanced partitioning algorithm which partitions the workloads evenly for all the reducers. This paper proposes a balanced partitioning mechanism with collapsed-condensed trie in MapReduce, which evenly distributes the workloads to the reducers. A collapsed-condensed trie is introduced for capturing the data statistics authentically, with which it requires a reasonable amount of memory usage and incurs a small running overhead. Then, we propose a quasi-optimal packing algorithm to assign sub-partitions to the reducers evenly, resulting in reducing the total execution time. The experiments using Inverted Indexing on the real-world datasets are conducted to evaluate the performance of our proposed partitioning mechanism.","PeriodicalId":340244,"journal":{"name":"2018 IEEE 8th International Symposium on Cloud and Service Computing (SC2)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"A Balanced Partitioning Mechanism Using Collapsed-Condensed Trie in MapReduce\",\"authors\":\"Hsing-Lung Chen, Syu-Huan Chen\",\"doi\":\"10.1109/SC2.2018.00020\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The MapReduce has emerged as an efficient platform for coping with big data. It achieves this goal by decoupling the data and then distributing the workloads to multiple reducers for processing in a fully parallel manner. Zipf's law asserts that, for many types of data studied in the physical and social sciences, the frequency of any event is inversely proportional to its rank in the frequency table, i.e. the key distribution is skewed. However, the hash function of MapReduce usually generates the unbalanced workloads to multiple reducers for the skewed data. The unbalanced workloads to multiple reducers lead to degrading the performance of MapReduce significantly, because the overall running time of a map-reduce cycle is determined by the longest running reducer. Thus, it is an important issue to develop a balanced partitioning algorithm which partitions the workloads evenly for all the reducers. This paper proposes a balanced partitioning mechanism with collapsed-condensed trie in MapReduce, which evenly distributes the workloads to the reducers. A collapsed-condensed trie is introduced for capturing the data statistics authentically, with which it requires a reasonable amount of memory usage and incurs a small running overhead. Then, we propose a quasi-optimal packing algorithm to assign sub-partitions to the reducers evenly, resulting in reducing the total execution time. The experiments using Inverted Indexing on the real-world datasets are conducted to evaluate the performance of our proposed partitioning mechanism.\",\"PeriodicalId\":340244,\"journal\":{\"name\":\"2018 IEEE 8th International Symposium on Cloud and Service Computing (SC2)\",\"volume\":\"72 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE 8th International Symposium on Cloud and Service Computing (SC2)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SC2.2018.00020\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 8th International Symposium on Cloud and Service Computing (SC2)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SC2.2018.00020","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

MapReduce已经成为处理大数据的高效平台。它通过解耦数据，然后将工作负载分配给多个reducer，以便以完全并行的方式进行处理，从而实现这一目标。齐夫定律断言，对于物理科学和社会科学中研究的许多类型的数据，任何事件的频率与其在频率表中的排名成反比，即关键分布是倾斜的。然而，MapReduce的哈希函数通常会对倾斜数据产生多个reducer的不平衡工作负载。由于map-reduce周期的总体运行时间由运行时间最长的reducer决定，因此多个reducer的负载不均衡会导致MapReduce的性能显著下降。因此，开发一种均衡的分区算法，为所有的reducer均匀地划分工作负载是一个重要的问题。本文在MapReduce中提出了一种基于折叠压缩trie的均衡分区机制，将工作负载均匀地分配给reducer。为了真实地捕获数据统计信息，引入了一个折叠压缩的trie，它需要合理的内存使用量，并且产生很小的运行开销。然后，我们提出了一种准最优打包算法，将子分区均匀地分配给reducer，从而减少了总执行时间。在实际数据集上使用倒排索引进行了实验，以评估我们提出的分区机制的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Balanced Partitioning Mechanism Using Collapsed-Condensed Trie in MapReduce

The MapReduce has emerged as an efficient platform for coping with big data. It achieves this goal by decoupling the data and then distributing the workloads to multiple reducers for processing in a fully parallel manner. Zipf's law asserts that, for many types of data studied in the physical and social sciences, the frequency of any event is inversely proportional to its rank in the frequency table, i.e. the key distribution is skewed. However, the hash function of MapReduce usually generates the unbalanced workloads to multiple reducers for the skewed data. The unbalanced workloads to multiple reducers lead to degrading the performance of MapReduce significantly, because the overall running time of a map-reduce cycle is determined by the longest running reducer. Thus, it is an important issue to develop a balanced partitioning algorithm which partitions the workloads evenly for all the reducers. This paper proposes a balanced partitioning mechanism with collapsed-condensed trie in MapReduce, which evenly distributes the workloads to the reducers. A collapsed-condensed trie is introduced for capturing the data statistics authentically, with which it requires a reasonable amount of memory usage and incurs a small running overhead. Then, we propose a quasi-optimal packing algorithm to assign sub-partitions to the reducers evenly, resulting in reducing the total execution time. The experiments using Inverted Indexing on the real-world datasets are conducted to evaluate the performance of our proposed partitioning mechanism.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2018 IEEE 8th International Symposium on Cloud and Service Computing (SC2)

自引率

0.00%

发文量