MapReduce框架下基于图的迭代算法性能分析

International Conference for Convergence for Technology-2014 Pub Date : 2014-04-06 DOI:10.1109/I2CT.2014.7092125

A. Debbarma, B. Annappa, Ravi G. Mude

{"title":"MapReduce框架下基于图的迭代算法性能分析","authors":"A. Debbarma, B. Annappa, Ravi G. Mude","doi":"10.1109/I2CT.2014.7092125","DOIUrl":null,"url":null,"abstract":"In the recent few years, there has been an enormous growth in the amount of digital data that is being produced. Numerous attempts are being made to process this large amount of data in a fast and effective manner. Hadoop MapReduce is one such software framework that has gained popularity in the last few years for distributed computation of Big Data. It provides a scalable, economical and easier way to process massive amounts of data in-parallel on large computing cluster preserving the properties of fault tolerance in a transparent manner. However, Hadoop always stores intermediate results to the local disk for running iterative jobs. As a result, Hadoop usually suffers from long execution runtimes for iterative jobs as it typically pays a high I/O cost, wasting CPU cycles and network bandwidth. This paper analyses the problems of existing Hadoop and compare its performance against iMapReduce and HaLoop for graph based iterative algorithms. HaLoop offers better performance as it stores intermediate results in cache and reuses those data on the next successive iteration. For using cache invariant data (inter-iteration locality) it schedules the tasks onto the same node that might occur in different iterations.","PeriodicalId":384966,"journal":{"name":"International Conference for Convergence for Technology-2014","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Performance analysis of graph based iterative algorithms on MapReduce framework\",\"authors\":\"A. Debbarma, B. Annappa, Ravi G. Mude\",\"doi\":\"10.1109/I2CT.2014.7092125\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In the recent few years, there has been an enormous growth in the amount of digital data that is being produced. Numerous attempts are being made to process this large amount of data in a fast and effective manner. Hadoop MapReduce is one such software framework that has gained popularity in the last few years for distributed computation of Big Data. It provides a scalable, economical and easier way to process massive amounts of data in-parallel on large computing cluster preserving the properties of fault tolerance in a transparent manner. However, Hadoop always stores intermediate results to the local disk for running iterative jobs. As a result, Hadoop usually suffers from long execution runtimes for iterative jobs as it typically pays a high I/O cost, wasting CPU cycles and network bandwidth. This paper analyses the problems of existing Hadoop and compare its performance against iMapReduce and HaLoop for graph based iterative algorithms. HaLoop offers better performance as it stores intermediate results in cache and reuses those data on the next successive iteration. For using cache invariant data (inter-iteration locality) it schedules the tasks onto the same node that might occur in different iterations.\",\"PeriodicalId\":384966,\"journal\":{\"name\":\"International Conference for Convergence for Technology-2014\",\"volume\":\"11 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-04-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Conference for Convergence for Technology-2014\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/I2CT.2014.7092125\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference for Convergence for Technology-2014","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/I2CT.2014.7092125","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

最近几年，正在产生的数字数据量有了巨大的增长。为了以快速有效的方式处理这一大量数据，正在进行许多尝试。Hadoop MapReduce就是这样一个软件框架，在过去的几年里，它在大数据的分布式计算中得到了普及。它提供了一种在大型计算集群上并行处理大量数据的可伸缩、经济且更简单的方法，并以透明的方式保留了容错的属性。但是，Hadoop总是将中间结果存储到本地磁盘以运行迭代作业。因此，Hadoop通常会遭受迭代作业的长执行运行时间的困扰，因为它通常支付高I/O成本，浪费CPU周期和网络带宽。本文分析了现有Hadoop存在的问题，并将其与基于图的迭代算法iMapReduce和HaLoop的性能进行了比较。HaLoop提供了更好的性能，因为它将中间结果存储在缓存中，并在下一次连续迭代中重用这些数据。对于使用缓存不变数据(迭代间局部性)，它将任务调度到可能在不同迭代中出现的同一节点上。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Performance analysis of graph based iterative algorithms on MapReduce framework

In the recent few years, there has been an enormous growth in the amount of digital data that is being produced. Numerous attempts are being made to process this large amount of data in a fast and effective manner. Hadoop MapReduce is one such software framework that has gained popularity in the last few years for distributed computation of Big Data. It provides a scalable, economical and easier way to process massive amounts of data in-parallel on large computing cluster preserving the properties of fault tolerance in a transparent manner. However, Hadoop always stores intermediate results to the local disk for running iterative jobs. As a result, Hadoop usually suffers from long execution runtimes for iterative jobs as it typically pays a high I/O cost, wasting CPU cycles and network bandwidth. This paper analyses the problems of existing Hadoop and compare its performance against iMapReduce and HaLoop for graph based iterative algorithms. HaLoop offers better performance as it stores intermediate results in cache and reuses those data on the next successive iteration. For using cache invariant data (inter-iteration locality) it schedules the tasks onto the same node that might occur in different iterations.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Conference for Convergence for Technology-2014

自引率

0.00%

发文量