A case for MapReduce over the internet

ACM Cloud and Autonomic Computing Conference Pub Date : 2013-08-09 DOI:10.1145/2494621.2494632

Hrishikesh Gadre, I. Rodero, J. Montes, M. Parashar

{"title":"A case for MapReduce over the internet","authors":"Hrishikesh Gadre, I. Rodero, J. Montes, M. Parashar","doi":"10.1145/2494621.2494632","DOIUrl":null,"url":null,"abstract":"In recent years, MapReduce programming model and specifically its open source implementation Hadoop has been widely used by organizations to perform large-scale data processing tasks such as web-indexing, data mining as well as scientific simulations. The key benefits of this programming model include its simple programming interface and ability to process massive datasets in a scalable fashion without requiring high-end computing infrastructure. We observe that the current design of Hadoop framework assumes a centralized execution environment involving a single datacenter. This assumption leads to simplified design decisions in the Hadoop architecture regarding efficient network usage, specifically in the replica-selection policy in Hadoop Distributed File System (HDFS) and in the reduce phase scheduling algorithm. In this paper, we investigate real-world scenarios in which MapReduce programming model and specifically Hadoop framework could be used for processing large-scale, geographically scattered datasets. We show that using the Hadoop framework with default policies can cause severe performance degradation in such geographically distributed environment. We propose and evaluate extensions to Hadoop MapReduce framework to improve its performance in such environments. The evaluation demonstrates that the proposed extensions substantially outperform default policies in the Hadoop framework.","PeriodicalId":190559,"journal":{"name":"ACM Cloud and Autonomic Computing Conference","volume":"197 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Cloud and Autonomic Computing Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2494621.2494632","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

In recent years, MapReduce programming model and specifically its open source implementation Hadoop has been widely used by organizations to perform large-scale data processing tasks such as web-indexing, data mining as well as scientific simulations. The key benefits of this programming model include its simple programming interface and ability to process massive datasets in a scalable fashion without requiring high-end computing infrastructure. We observe that the current design of Hadoop framework assumes a centralized execution environment involving a single datacenter. This assumption leads to simplified design decisions in the Hadoop architecture regarding efficient network usage, specifically in the replica-selection policy in Hadoop Distributed File System (HDFS) and in the reduce phase scheduling algorithm. In this paper, we investigate real-world scenarios in which MapReduce programming model and specifically Hadoop framework could be used for processing large-scale, geographically scattered datasets. We show that using the Hadoop framework with default policies can cause severe performance degradation in such geographically distributed environment. We propose and evaluate extensions to Hadoop MapReduce framework to improve its performance in such environments. The evaluation demonstrates that the proposed extensions substantially outperform default policies in the Hadoop framework.

查看原文本刊更多论文

MapReduce在互联网上的一个案例

近年来，MapReduce编程模型及其开源实现Hadoop已被组织广泛用于执行大规模数据处理任务，如web索引，数据挖掘以及科学模拟。这种编程模型的主要优点包括其简单的编程接口和以可扩展的方式处理大量数据集的能力，而不需要高端的计算基础设施。我们观察到Hadoop框架的当前设计假设了一个涉及单个数据中心的集中执行环境。这一假设简化了Hadoop架构中有关高效网络使用的设计决策，特别是在Hadoop分布式文件系统(HDFS)中的副本选择策略和reduce阶段调度算法中。在本文中，我们研究了MapReduce编程模型和Hadoop框架可用于处理大规模地理分散数据集的现实场景。我们表明，在这种地理分布的环境中，使用带有默认策略的Hadoop框架可能会导致严重的性能下降。我们提出并评估了Hadoop MapReduce框架的扩展，以提高其在此类环境中的性能。评估表明，提议的扩展在Hadoop框架中的性能大大优于默认策略。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Cloud and Autonomic Computing Conference

自引率

0.00%

发文量