Capacity-aware key partitioning scheme for heterogeneous big data analytic engines

2018 20th International Conference on Advanced Communication Technology (ICACT) Pub Date : 2018-02-01 DOI:10.23919/ICACT.2018.8323921

Muhammad Hanif, Choonhwa Lee

{"title":"Capacity-aware key partitioning scheme for heterogeneous big data analytic engines","authors":"Muhammad Hanif, Choonhwa Lee","doi":"10.23919/ICACT.2018.8323921","DOIUrl":null,"url":null,"abstract":"Big data and cloud computing became the centre of interest for the past decade. With the increase of data size and different cloud application, the idea of big data analytics become very popular both in industry and academia. The research communities in industry and academia never stopped trying to come up with the fast, robust, and fault tolerant analytic engines. MapReduce becomes one of the popular big data analytic engine over the past few years. Hadoop is a standard implementation of MapReduce framework for running data-intensive applications on the clusters of commodity servers. By thoroughly studying the framework we find out that the shuffle phase, all-to-all input data fetching phase in reduce task significantly affect the application performance. There is a problem of variance in both the intermediate key's frequencies and their distribution among data nodes throughout the cluster in Hadoop's MapReduce system. This variance in system causes network overhead which leads to unfairness on the reduce input among different data nodes in the cluster. Because of the above problems, applications experience performance degradation due to shuffle phase of MapReduce applications. We develop a new novel algorithm; unlike previous systems our algorithm considers each node's capabilities as heuristics to decide a better available trade-off for the locality and fairness in the system. By comparing with the default Hadoop's partitioning algorithm and Leen partitioning algorithm: a). In case of 2 million key-value pairs to process, on the average our approach achieve better resource utilization by about 19%, and 9%, in that order; b). In case of 3 million key-value pairs to process, our approach achieve near optimal resource utilization by about 15%, and 7%, respectively.","PeriodicalId":228625,"journal":{"name":"2018 20th International Conference on Advanced Communication Technology (ICACT)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 20th International Conference on Advanced Communication Technology (ICACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/ICACT.2018.8323921","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Big data and cloud computing became the centre of interest for the past decade. With the increase of data size and different cloud application, the idea of big data analytics become very popular both in industry and academia. The research communities in industry and academia never stopped trying to come up with the fast, robust, and fault tolerant analytic engines. MapReduce becomes one of the popular big data analytic engine over the past few years. Hadoop is a standard implementation of MapReduce framework for running data-intensive applications on the clusters of commodity servers. By thoroughly studying the framework we find out that the shuffle phase, all-to-all input data fetching phase in reduce task significantly affect the application performance. There is a problem of variance in both the intermediate key's frequencies and their distribution among data nodes throughout the cluster in Hadoop's MapReduce system. This variance in system causes network overhead which leads to unfairness on the reduce input among different data nodes in the cluster. Because of the above problems, applications experience performance degradation due to shuffle phase of MapReduce applications. We develop a new novel algorithm; unlike previous systems our algorithm considers each node's capabilities as heuristics to decide a better available trade-off for the locality and fairness in the system. By comparing with the default Hadoop's partitioning algorithm and Leen partitioning algorithm: a). In case of 2 million key-value pairs to process, on the average our approach achieve better resource utilization by about 19%, and 9%, in that order; b). In case of 3 million key-value pairs to process, our approach achieve near optimal resource utilization by about 15%, and 7%, respectively.

查看原文本刊更多论文

异构大数据分析引擎的容量感知密钥划分方案

过去十年，大数据和云计算成为人们关注的焦点。随着数据量的增加和云应用的不同，大数据分析的思想在工业界和学术界都非常流行。工业界和学术界的研究团体从未停止过提出快速、健壮和容错的分析引擎的尝试。MapReduce在过去几年中成为流行的大数据分析引擎之一。Hadoop是MapReduce框架的标准实现，用于在商品服务器集群上运行数据密集型应用程序。通过对该框架的深入研究，我们发现reduce任务中的shuffle阶段、全对全输入数据提取阶段对应用程序的性能影响很大。在Hadoop的MapReduce系统中，中间键的频率和它们在整个集群中数据节点之间的分布都存在差异的问题。这种系统上的差异会导致网络开销，从而导致集群中不同数据节点之间的reduce输入不公平。由于上述问题，应用程序会由于MapReduce应用程序的shuffle阶段而出现性能下降。我们开发了一种新的算法;与以前的系统不同，我们的算法将每个节点的能力视为启发式算法，以决定系统中局部性和公平性的更好可用权衡。对比默认Hadoop的分区算法和Leen分区算法:a).在处理200万对键值对的情况下，我们的方法平均提高了19%左右的资源利用率，依次提高了9%;b).在处理300万个键值对的情况下，我们的方法实现了接近最优的资源利用率，分别提高了约15%和7%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 20th International Conference on Advanced Communication Technology (ICACT)

自引率

0.00%

发文量