Performance Improvement of MapReduce Process by Promoting Deep Data Locality

2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA) Pub Date : 2016-10-01 DOI:10.1109/DSAA.2016.38

Sungchul Lee, Ju-Yeon Jo, Yoohwan Kim

{"title":"Performance Improvement of MapReduce Process by Promoting Deep Data Locality","authors":"Sungchul Lee, Ju-Yeon Jo, Yoohwan Kim","doi":"10.1109/DSAA.2016.38","DOIUrl":null,"url":null,"abstract":"MapReduce has been widely used in many data science applications. It has been observed that an excessive data transfer has a negative impact on its performance. To reduce the amount of data transfer, MapReduce utilizes data locality. However, even though the majority of the processing cost occurs in the later stages, data locality has been utilized only in the early stages, which we call Shallow Data Locality (SDL). As a result, the benefit of data locality has not been fully realized. We have explored a new concept called Deep Data Locality (DDL) where the data is pre-arranged to maximize the locality in the later stages. Toward achieving stronger DDL, we introduce a new block placement paradigm called Limited Node Block Placement Policy (LNBPP). Under the conventional default block placement policy (DBPP), data blocks are randomly placed on any available slave nodes, requiring a copy of RLM (Rack-Local Map) blocks. On the other hand, LNBPP places the blocks in a way to avoid RLMs, reducing the block copying time. The containers without RLM have a more consistent execution time, and when assigned to individual cores on a multicore node, they finish a job faster collectively than the containers under DBPP. LNBPP also rearranges the blocks into a smaller number of nodes (hence Limited Node) and reduces the data transfer time between nodes. These strategies bring a significant performance improvement in Map and Shuffle. Our test result shows that the execution times of Map and Shuffle have been improved by up to 33% and 44% respectively. In this paper, we describe the MapReduce workflow in Hadoop with a simple computational model and introduce the current research directions in each step. We analyze the block placement status and RLM locations in DBPP with the customer review data from TripAdvisor and measure the performances by executing the Terasort Benchmark with various sizes of data. We then compare the performances of LNBPP with DBPP.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSAA.2016.38","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

Abstract

MapReduce has been widely used in many data science applications. It has been observed that an excessive data transfer has a negative impact on its performance. To reduce the amount of data transfer, MapReduce utilizes data locality. However, even though the majority of the processing cost occurs in the later stages, data locality has been utilized only in the early stages, which we call Shallow Data Locality (SDL). As a result, the benefit of data locality has not been fully realized. We have explored a new concept called Deep Data Locality (DDL) where the data is pre-arranged to maximize the locality in the later stages. Toward achieving stronger DDL, we introduce a new block placement paradigm called Limited Node Block Placement Policy (LNBPP). Under the conventional default block placement policy (DBPP), data blocks are randomly placed on any available slave nodes, requiring a copy of RLM (Rack-Local Map) blocks. On the other hand, LNBPP places the blocks in a way to avoid RLMs, reducing the block copying time. The containers without RLM have a more consistent execution time, and when assigned to individual cores on a multicore node, they finish a job faster collectively than the containers under DBPP. LNBPP also rearranges the blocks into a smaller number of nodes (hence Limited Node) and reduces the data transfer time between nodes. These strategies bring a significant performance improvement in Map and Shuffle. Our test result shows that the execution times of Map and Shuffle have been improved by up to 33% and 44% respectively. In this paper, we describe the MapReduce workflow in Hadoop with a simple computational model and introduce the current research directions in each step. We analyze the block placement status and RLM locations in DBPP with the customer review data from TripAdvisor and measure the performances by executing the Terasort Benchmark with various sizes of data. We then compare the performances of LNBPP with DBPP.

查看原文本刊更多论文

基于深度数据局部性的MapReduce进程性能改进

MapReduce已经广泛应用于许多数据科学应用中。据观察，过度的数据传输会对其性能产生负面影响。为了减少数据传输量，MapReduce利用了数据的局域性。然而，即使大部分处理成本发生在后期阶段，数据局部性只在早期阶段被利用，我们称之为浅数据局部性(SDL)。因此，数据局部性的好处没有得到充分发挥。我们探索了一个名为深度数据局部性(DDL)的新概念，其中数据被预先安排以在后期阶段最大化局部性。为了实现更强的DDL，我们引入了一种新的块放置范式，称为有限节点块放置策略(LNBPP)。在传统的默认块放置策略(DBPP)下，数据块被随机放置在任何可用的从节点上，这需要RLM (Rack-Local Map)块的副本。另一方面，LNBPP以一种避免rlm的方式放置块，减少了块复制时间。没有RLM的容器具有更一致的执行时间，并且当将它们分配给多核节点上的单个内核时，它们比DBPP下的容器更快地完成任务。LNBPP还将区块重新排列成更少的节点(Limited Node)，减少了节点之间的数据传输时间。这些策略显著提高了Map和Shuffle的性能。我们的测试结果表明，Map和Shuffle的执行时间分别提高了33%和44%。本文用一个简单的计算模型描述了Hadoop中MapReduce的工作流程，并在每个步骤中介绍了当前的研究方向。我们使用TripAdvisor的客户评论数据分析了DBPP中的块放置状态和RLM位置，并通过使用不同大小的数据执行Terasort Benchmark来衡量性能。然后我们比较了LNBPP和DBPP的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

自引率

0.00%

发文量