Data Pipeline in MapReduce

2013 IEEE 9th International Conference on e-Science Pub Date : 2013-10-22 DOI:10.1109/eScience.2013.21

Jiaan Zeng, Beth Plale

{"title":"Data Pipeline in MapReduce","authors":"Jiaan Zeng, Beth Plale","doi":"10.1109/eScience.2013.21","DOIUrl":null,"url":null,"abstract":"MapReduce is an effective programming model for large scale text and data analysis. Traditional MapReduce implementation, e.g., Hadoop, has the restriction that before any analysis can take place, the entire input dataset must be loaded into the cluster. This can introduce sizable latency when the data set is large, and when it is not possible to load the data once, and process many times - a situation that exists for log files, health records and protected texts for instance. We propose a data pipeline approach to hide data upload latency in MapReduce analysis. Our implementation, which is based on Hadoop MapReduce, is completely transparent to user. It introduces a distributed concurrency queue to coordinate data block allocation and synchronization so as to overlap data upload and execution. The paper overcomes two challenges: a fixed number of maps scheduling and dynamic number of maps scheduling allows for better handling of input data sets of unknown size. We also employ delay scheduler to achieve data locality for data pipeline. The evaluation of the solution on different applications on real world data sets shows that our approach shows performance gains.","PeriodicalId":325272,"journal":{"name":"2013 IEEE 9th International Conference on e-Science","volume":"89 25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE 9th International Conference on e-Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/eScience.2013.21","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

MapReduce is an effective programming model for large scale text and data analysis. Traditional MapReduce implementation, e.g., Hadoop, has the restriction that before any analysis can take place, the entire input dataset must be loaded into the cluster. This can introduce sizable latency when the data set is large, and when it is not possible to load the data once, and process many times - a situation that exists for log files, health records and protected texts for instance. We propose a data pipeline approach to hide data upload latency in MapReduce analysis. Our implementation, which is based on Hadoop MapReduce, is completely transparent to user. It introduces a distributed concurrency queue to coordinate data block allocation and synchronization so as to overlap data upload and execution. The paper overcomes two challenges: a fixed number of maps scheduling and dynamic number of maps scheduling allows for better handling of input data sets of unknown size. We also employ delay scheduler to achieve data locality for data pipeline. The evaluation of the solution on different applications on real world data sets shows that our approach shows performance gains.

查看原文本刊更多论文

MapReduce中的数据管道

MapReduce是一种用于大规模文本和数据分析的有效编程模型。传统的MapReduce实现，例如Hadoop，有一个限制，在进行任何分析之前，必须将整个输入数据集加载到集群中。当数据集很大，并且不可能一次加载数据并多次处理时(例如，日志文件、健康记录和受保护的文本就存在这种情况)，这会导致相当大的延迟。我们提出了一种数据管道方法来隐藏MapReduce分析中的数据上传延迟。我们的实现基于Hadoop MapReduce，对用户是完全透明的。引入分布式并发队列来协调数据块的分配和同步，实现数据上传和执行的重叠。本文克服了两个挑战:固定数量的地图调度和动态数量的地图调度允许更好地处理未知大小的输入数据集。我们还使用延迟调度器来实现数据管道的数据局部性。在真实世界数据集的不同应用程序上对解决方案的评估表明，我们的方法显示出性能提升。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2013 IEEE 9th International Conference on e-Science

自引率

0.00%

发文量