A case for MapReduce over the internet

Hrishikesh Gadre, I. Rodero, J. Montes, M. Parashar
{"title":"A case for MapReduce over the internet","authors":"Hrishikesh Gadre, I. Rodero, J. Montes, M. Parashar","doi":"10.1145/2494621.2494632","DOIUrl":null,"url":null,"abstract":"In recent years, MapReduce programming model and specifically its open source implementation Hadoop has been widely used by organizations to perform large-scale data processing tasks such as web-indexing, data mining as well as scientific simulations. The key benefits of this programming model include its simple programming interface and ability to process massive datasets in a scalable fashion without requiring high-end computing infrastructure. We observe that the current design of Hadoop framework assumes a centralized execution environment involving a single datacenter. This assumption leads to simplified design decisions in the Hadoop architecture regarding efficient network usage, specifically in the replica-selection policy in Hadoop Distributed File System (HDFS) and in the reduce phase scheduling algorithm. In this paper, we investigate real-world scenarios in which MapReduce programming model and specifically Hadoop framework could be used for processing large-scale, geographically scattered datasets. We show that using the Hadoop framework with default policies can cause severe performance degradation in such geographically distributed environment. We propose and evaluate extensions to Hadoop MapReduce framework to improve its performance in such environments. The evaluation demonstrates that the proposed extensions substantially outperform default policies in the Hadoop framework.","PeriodicalId":190559,"journal":{"name":"ACM Cloud and Autonomic Computing Conference","volume":"197 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Cloud and Autonomic Computing Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2494621.2494632","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

In recent years, MapReduce programming model and specifically its open source implementation Hadoop has been widely used by organizations to perform large-scale data processing tasks such as web-indexing, data mining as well as scientific simulations. The key benefits of this programming model include its simple programming interface and ability to process massive datasets in a scalable fashion without requiring high-end computing infrastructure. We observe that the current design of Hadoop framework assumes a centralized execution environment involving a single datacenter. This assumption leads to simplified design decisions in the Hadoop architecture regarding efficient network usage, specifically in the replica-selection policy in Hadoop Distributed File System (HDFS) and in the reduce phase scheduling algorithm. In this paper, we investigate real-world scenarios in which MapReduce programming model and specifically Hadoop framework could be used for processing large-scale, geographically scattered datasets. We show that using the Hadoop framework with default policies can cause severe performance degradation in such geographically distributed environment. We propose and evaluate extensions to Hadoop MapReduce framework to improve its performance in such environments. The evaluation demonstrates that the proposed extensions substantially outperform default policies in the Hadoop framework.
MapReduce在互联网上的一个案例
近年来,MapReduce编程模型及其开源实现Hadoop已被组织广泛用于执行大规模数据处理任务,如web索引,数据挖掘以及科学模拟。这种编程模型的主要优点包括其简单的编程接口和以可扩展的方式处理大量数据集的能力,而不需要高端的计算基础设施。我们观察到Hadoop框架的当前设计假设了一个涉及单个数据中心的集中执行环境。这一假设简化了Hadoop架构中有关高效网络使用的设计决策,特别是在Hadoop分布式文件系统(HDFS)中的副本选择策略和reduce阶段调度算法中。在本文中,我们研究了MapReduce编程模型和Hadoop框架可用于处理大规模地理分散数据集的现实场景。我们表明,在这种地理分布的环境中,使用带有默认策略的Hadoop框架可能会导致严重的性能下降。我们提出并评估了Hadoop MapReduce框架的扩展,以提高其在此类环境中的性能。评估表明,提议的扩展在Hadoop框架中的性能大大优于默认策略。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信