{"title":"MapReduce中基于折叠压缩树的均衡分区机制","authors":"Hsing-Lung Chen, Syu-Huan Chen","doi":"10.1109/SC2.2018.00020","DOIUrl":null,"url":null,"abstract":"The MapReduce has emerged as an efficient platform for coping with big data. It achieves this goal by decoupling the data and then distributing the workloads to multiple reducers for processing in a fully parallel manner. Zipf's law asserts that, for many types of data studied in the physical and social sciences, the frequency of any event is inversely proportional to its rank in the frequency table, i.e. the key distribution is skewed. However, the hash function of MapReduce usually generates the unbalanced workloads to multiple reducers for the skewed data. The unbalanced workloads to multiple reducers lead to degrading the performance of MapReduce significantly, because the overall running time of a map-reduce cycle is determined by the longest running reducer. Thus, it is an important issue to develop a balanced partitioning algorithm which partitions the workloads evenly for all the reducers. This paper proposes a balanced partitioning mechanism with collapsed-condensed trie in MapReduce, which evenly distributes the workloads to the reducers. A collapsed-condensed trie is introduced for capturing the data statistics authentically, with which it requires a reasonable amount of memory usage and incurs a small running overhead. Then, we propose a quasi-optimal packing algorithm to assign sub-partitions to the reducers evenly, resulting in reducing the total execution time. The experiments using Inverted Indexing on the real-world datasets are conducted to evaluate the performance of our proposed partitioning mechanism.","PeriodicalId":340244,"journal":{"name":"2018 IEEE 8th International Symposium on Cloud and Service Computing (SC2)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"A Balanced Partitioning Mechanism Using Collapsed-Condensed Trie in MapReduce\",\"authors\":\"Hsing-Lung Chen, Syu-Huan Chen\",\"doi\":\"10.1109/SC2.2018.00020\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The MapReduce has emerged as an efficient platform for coping with big data. It achieves this goal by decoupling the data and then distributing the workloads to multiple reducers for processing in a fully parallel manner. Zipf's law asserts that, for many types of data studied in the physical and social sciences, the frequency of any event is inversely proportional to its rank in the frequency table, i.e. the key distribution is skewed. However, the hash function of MapReduce usually generates the unbalanced workloads to multiple reducers for the skewed data. The unbalanced workloads to multiple reducers lead to degrading the performance of MapReduce significantly, because the overall running time of a map-reduce cycle is determined by the longest running reducer. Thus, it is an important issue to develop a balanced partitioning algorithm which partitions the workloads evenly for all the reducers. This paper proposes a balanced partitioning mechanism with collapsed-condensed trie in MapReduce, which evenly distributes the workloads to the reducers. A collapsed-condensed trie is introduced for capturing the data statistics authentically, with which it requires a reasonable amount of memory usage and incurs a small running overhead. Then, we propose a quasi-optimal packing algorithm to assign sub-partitions to the reducers evenly, resulting in reducing the total execution time. The experiments using Inverted Indexing on the real-world datasets are conducted to evaluate the performance of our proposed partitioning mechanism.\",\"PeriodicalId\":340244,\"journal\":{\"name\":\"2018 IEEE 8th International Symposium on Cloud and Service Computing (SC2)\",\"volume\":\"72 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE 8th International Symposium on Cloud and Service Computing (SC2)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SC2.2018.00020\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 8th International Symposium on Cloud and Service Computing (SC2)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SC2.2018.00020","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Balanced Partitioning Mechanism Using Collapsed-Condensed Trie in MapReduce
The MapReduce has emerged as an efficient platform for coping with big data. It achieves this goal by decoupling the data and then distributing the workloads to multiple reducers for processing in a fully parallel manner. Zipf's law asserts that, for many types of data studied in the physical and social sciences, the frequency of any event is inversely proportional to its rank in the frequency table, i.e. the key distribution is skewed. However, the hash function of MapReduce usually generates the unbalanced workloads to multiple reducers for the skewed data. The unbalanced workloads to multiple reducers lead to degrading the performance of MapReduce significantly, because the overall running time of a map-reduce cycle is determined by the longest running reducer. Thus, it is an important issue to develop a balanced partitioning algorithm which partitions the workloads evenly for all the reducers. This paper proposes a balanced partitioning mechanism with collapsed-condensed trie in MapReduce, which evenly distributes the workloads to the reducers. A collapsed-condensed trie is introduced for capturing the data statistics authentically, with which it requires a reasonable amount of memory usage and incurs a small running overhead. Then, we propose a quasi-optimal packing algorithm to assign sub-partitions to the reducers evenly, resulting in reducing the total execution time. The experiments using Inverted Indexing on the real-world datasets are conducted to evaluate the performance of our proposed partitioning mechanism.