Dynamic Data Redistribution for MapReduce Joins

2011 IEEE Third International Conference on Cloud Computing Technology and Science Pub Date : 2011-11-29 DOI:10.1109/CloudCom.2011.111

S. Lynden, Y. Tanimura, I. Kojima, Akiyoshi Matono

引用次数: 10

Abstract

MapReduce has become a popular method for data processing, in particular for large scale datasets, due to its accessibility as a scalable yet convenient programming paradigm. Data processing tasks often involve joins, and the repartition and fragment-replicate joins are two widely-used join algorithms utilised within the MapReduce framework. This paper presents a multi-join supporting tuple redistribution, building on both the repartition and fragment-replicate joins. Hadoop is used to demonstrate how reduce tasks may improve performance by passing intermediate results to other reduce tasks that are better able to process them using Apache ZooKeeper as a means of communication and data transfer. A performance analysis is presented showing the technique has the potential to reduce response times when processing multiple joins in single MapReduce jobs.

查看原文本刊更多论文

MapReduce连接的动态数据重分配

MapReduce已经成为一种流行的数据处理方法，特别是对于大规模数据集，因为它作为一种可扩展且方便的编程范例的可访问性。数据处理任务通常涉及连接，重分区和片段复制连接是MapReduce框架中使用的两种广泛使用的连接算法。在重分区连接和片段复制连接的基础上，提出了一种支持元组重分发的多连接。Hadoop用于演示reduce任务如何通过将中间结果传递给其他能够更好地处理它们的reduce任务来提高性能，使用Apache ZooKeeper作为通信和数据传输的手段。性能分析显示，当在单个MapReduce作业中处理多个连接时，该技术有可能减少响应时间。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2011 IEEE Third International Conference on Cloud Computing Technology and Science

自引率

0.00%

发文量