IncMR: Incremental Data Processing Based on MapReduce

Cairong Yan, Xin Yang, Ze Yu, Min Li, Xiaolin Li
{"title":"IncMR: Incremental Data Processing Based on MapReduce","authors":"Cairong Yan, Xin Yang, Ze Yu, Min Li, Xiaolin Li","doi":"10.1109/CLOUD.2012.67","DOIUrl":null,"url":null,"abstract":"MapReduce programming model is widely used for large scale and one-time data-intensive distributed computing, but lacks flexibility and efficiency of processing small incremental data. IncMR framework is proposed in this paper for incrementally processing new data of a large data set, which takes state as implicit input and combines it with new data. Map tasks are created according to new splits instead of entire splits while reduce tasks fetch their inputs including the state and the intermediate results of new map tasks from designate nodes or local nodes. Data locality is considered as one of the main optimization means for job scheduling. It is implemented based on Hadoop, compatible with the original MapReduce interfaces and transparent to users. Experiments show that non-iterative algorithms running in MapReduce framework can be migrated to IncMR directly to get efficient incremental and continuous processing without any modification. IncMR is competitive and in all studied cases runs faster than that processing the entire data set.","PeriodicalId":214084,"journal":{"name":"2012 IEEE Fifth International Conference on Cloud Computing","volume":"102 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"45","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE Fifth International Conference on Cloud Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CLOUD.2012.67","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 45

Abstract

MapReduce programming model is widely used for large scale and one-time data-intensive distributed computing, but lacks flexibility and efficiency of processing small incremental data. IncMR framework is proposed in this paper for incrementally processing new data of a large data set, which takes state as implicit input and combines it with new data. Map tasks are created according to new splits instead of entire splits while reduce tasks fetch their inputs including the state and the intermediate results of new map tasks from designate nodes or local nodes. Data locality is considered as one of the main optimization means for job scheduling. It is implemented based on Hadoop, compatible with the original MapReduce interfaces and transparent to users. Experiments show that non-iterative algorithms running in MapReduce framework can be migrated to IncMR directly to get efficient incremental and continuous processing without any modification. IncMR is competitive and in all studied cases runs faster than that processing the entire data set.
IncMR:基于MapReduce的增量数据处理
MapReduce编程模型广泛应用于大规模、一次性数据密集型的分布式计算,但在处理少量增量数据时缺乏灵活性和效率。本文提出了一种以状态为隐式输入并与新数据相结合的增量处理大数据集新数据的IncMR框架。Map任务是根据新的分割而不是整个分割创建的,而reduce任务从指定节点或本地节点获取其输入,包括新Map任务的状态和中间结果。数据局部性被认为是作业调度的主要优化手段之一。它基于Hadoop实现,兼容原有MapReduce接口,对用户透明。实验表明,在MapReduce框架下运行的非迭代算法可以直接迁移到IncMR中,无需任何修改即可获得高效的增量和连续处理。IncMR是有竞争力的,在所有研究的案例中,它比处理整个数据集的速度都要快。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信