iHadoop: Asynchronous Iterations for MapReduce

2011 IEEE Third International Conference on Cloud Computing Technology and Science Pub Date : 2011-11-29 DOI:10.1109/CloudCom.2011.21

Eslam Elnikety, T. Elsayed, Hany E. Ramadan

{"title":"iHadoop: Asynchronous Iterations for MapReduce","authors":"Eslam Elnikety, T. Elsayed, Hany E. Ramadan","doi":"10.1109/CloudCom.2011.21","DOIUrl":null,"url":null,"abstract":"MapReduce is a distributed programming framework designed to ease the development of scalable data-intensive applications for large clusters of commodity machines. Most machine learning and data mining applications involve iterative computations over large datasets, such as the Web hyperlink structures and social network graphs. Yet, the MapReduce model does not efficiently support this important class of applications. The architecture of MapReduce, most critically its dataflow techniques and task scheduling, is completely unaware of the nature of iterative applications, tasks are scheduled according to a policy that optimizes the execution for a single iteration which wastes bandwidth, I/O, and CPU cycles when compared with an optimal execution for a consecutive set of iterations. This work presents iHadoop, a modified MapReduce model, and an associated implementation, optimized for iterative computations. The iHadoop model schedules iterations asynchronously. It connects the output of one iteration to the next, allowing both to process their data concurrently. iHadoop's task scheduler exploits inter-iteration data locality by scheduling tasks that exhibit a producer/consumer relation on the same physical machine allowing a fast local data transfer. For those iterative applications that require satisfying certain criteria before termination, iHadoop runs the check concurrently during the execution of the subsequent iteration to further reduce the application's latency. This paper also describes our implementation of the iHadoop model, and evaluates its performance against Hadoop, the widely used open source implementation of MapReduce. Experiments using different data analysis applications over real-world and synthetic datasets show that iHadoop performs better than Hadoop for iterative algorithms, reducing execution time of iterative applications by 25% on average. Furthermore, integrating iHadoop with HaLoop, a variant Hadoop implementation that caches invariant data between iterations, reduces execution time by 38% on average.","PeriodicalId":427190,"journal":{"name":"2011 IEEE Third International Conference on Cloud Computing Technology and Science","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"63","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE Third International Conference on Cloud Computing Technology and Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CloudCom.2011.21","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 63

Abstract

MapReduce is a distributed programming framework designed to ease the development of scalable data-intensive applications for large clusters of commodity machines. Most machine learning and data mining applications involve iterative computations over large datasets, such as the Web hyperlink structures and social network graphs. Yet, the MapReduce model does not efficiently support this important class of applications. The architecture of MapReduce, most critically its dataflow techniques and task scheduling, is completely unaware of the nature of iterative applications, tasks are scheduled according to a policy that optimizes the execution for a single iteration which wastes bandwidth, I/O, and CPU cycles when compared with an optimal execution for a consecutive set of iterations. This work presents iHadoop, a modified MapReduce model, and an associated implementation, optimized for iterative computations. The iHadoop model schedules iterations asynchronously. It connects the output of one iteration to the next, allowing both to process their data concurrently. iHadoop's task scheduler exploits inter-iteration data locality by scheduling tasks that exhibit a producer/consumer relation on the same physical machine allowing a fast local data transfer. For those iterative applications that require satisfying certain criteria before termination, iHadoop runs the check concurrently during the execution of the subsequent iteration to further reduce the application's latency. This paper also describes our implementation of the iHadoop model, and evaluates its performance against Hadoop, the widely used open source implementation of MapReduce. Experiments using different data analysis applications over real-world and synthetic datasets show that iHadoop performs better than Hadoop for iterative algorithms, reducing execution time of iterative applications by 25% on average. Furthermore, integrating iHadoop with HaLoop, a variant Hadoop implementation that caches invariant data between iterations, reduces execution time by 38% on average.

查看原文本刊更多论文

iHadoop: MapReduce的异步迭代

MapReduce是一个分布式编程框架，旨在简化大型商用机器集群的可扩展数据密集型应用程序的开发。大多数机器学习和数据挖掘应用程序涉及对大型数据集的迭代计算，例如Web超链接结构和社交网络图。然而，MapReduce模型并不能有效地支持这类重要的应用程序。MapReduce的架构，最关键的是它的数据流技术和任务调度，完全没有意识到迭代应用程序的本质，任务是根据一个策略来调度的，该策略优化了单个迭代的执行，与连续迭代的最佳执行相比，这浪费了带宽、I/O和CPU周期。这项工作提出了iHadoop，一个改进的MapReduce模型，以及一个相关的实现，优化了迭代计算。iHadoop模型异步调度迭代。它将一个迭代的输出连接到下一个迭代，允许两者并发地处理它们的数据。iHadoop的任务调度器利用迭代间的数据局部性，在同一台物理机器上调度显示生产者/消费者关系的任务，从而允许快速的本地数据传输。对于那些需要在终止前满足特定条件的迭代应用程序，iHadoop在执行后续迭代期间并发地运行检查，以进一步减少应用程序的延迟。本文还描述了我们的iHadoop模型的实现，并对其性能进行了比较，Hadoop是MapReduce广泛使用的开源实现。在真实数据集和合成数据集上使用不同数据分析应用程序的实验表明，iHadoop在迭代算法上的表现优于Hadoop，迭代应用程序的执行时间平均减少了25%。此外，将iHadoop与HaLoop(一种在迭代之间缓存不变数据的变种Hadoop实现)集成，平均可减少38%的执行时间。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2011 IEEE Third International Conference on Cloud Computing Technology and Science

自引率

0.00%

发文量