IncMR: Incremental Data Processing Based on MapReduce

2012 IEEE Fifth International Conference on Cloud Computing Pub Date : 2012-06-24 DOI:10.1109/CLOUD.2012.67

Cairong Yan, Xin Yang, Ze Yu, Min Li, Xiaolin Li

引用次数: 45

Abstract

MapReduce programming model is widely used for large scale and one-time data-intensive distributed computing, but lacks flexibility and efficiency of processing small incremental data. IncMR framework is proposed in this paper for incrementally processing new data of a large data set, which takes state as implicit input and combines it with new data. Map tasks are created according to new splits instead of entire splits while reduce tasks fetch their inputs including the state and the intermediate results of new map tasks from designate nodes or local nodes. Data locality is considered as one of the main optimization means for job scheduling. It is implemented based on Hadoop, compatible with the original MapReduce interfaces and transparent to users. Experiments show that non-iterative algorithms running in MapReduce framework can be migrated to IncMR directly to get efficient incremental and continuous processing without any modification. IncMR is competitive and in all studied cases runs faster than that processing the entire data set.

查看原文本刊更多论文

IncMR:基于MapReduce的增量数据处理

MapReduce编程模型广泛应用于大规模、一次性数据密集型的分布式计算，但在处理少量增量数据时缺乏灵活性和效率。本文提出了一种以状态为隐式输入并与新数据相结合的增量处理大数据集新数据的IncMR框架。Map任务是根据新的分割而不是整个分割创建的，而reduce任务从指定节点或本地节点获取其输入，包括新Map任务的状态和中间结果。数据局部性被认为是作业调度的主要优化手段之一。它基于Hadoop实现，兼容原有MapReduce接口，对用户透明。实验表明，在MapReduce框架下运行的非迭代算法可以直接迁移到IncMR中，无需任何修改即可获得高效的增量和连续处理。IncMR是有竞争力的，在所有研究的案例中，它比处理整个数据集的速度都要快。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2012 IEEE Fifth International Conference on Cloud Computing

自引率

0.00%

发文量