A Review on Data locality in Hadoop MapReduce

Anil Sharma, Gurwinder Singh
{"title":"A Review on Data locality in Hadoop MapReduce","authors":"Anil Sharma, Gurwinder Singh","doi":"10.1109/PDGC.2018.8745928","DOIUrl":null,"url":null,"abstract":"MapReduce has emerged as a strong model for processing parallel and distributed data for huge datasets. Hadoop an open source implementation of MapReduce has approved MapReduce widely. Hadoop fragments the input file into number of data blocks to allocate them to various DataNodes in cluster. Hadoop must provide effective scheduling to process these data blocks in efficient way. One of the issues that play vital role in efficient processing of MapReduce is Data Locality which is caused due to overhead of network. Data locality is equipped for moving the computation adjacent to the data where it dwells. It is a key resource in distributed environment which influences the tasks accomplishing time. The issues which troubles data locality are cluster and network load, resource sharing, cluster environment, size of data blocks, number of mappers and reducers. This paper aims to review various algorithms that are aware of data locality in scheduling, along with their strengths and weaknesses.","PeriodicalId":303401,"journal":{"name":"2018 Fifth International Conference on Parallel, Distributed and Grid Computing (PDGC)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 Fifth International Conference on Parallel, Distributed and Grid Computing (PDGC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PDGC.2018.8745928","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

MapReduce has emerged as a strong model for processing parallel and distributed data for huge datasets. Hadoop an open source implementation of MapReduce has approved MapReduce widely. Hadoop fragments the input file into number of data blocks to allocate them to various DataNodes in cluster. Hadoop must provide effective scheduling to process these data blocks in efficient way. One of the issues that play vital role in efficient processing of MapReduce is Data Locality which is caused due to overhead of network. Data locality is equipped for moving the computation adjacent to the data where it dwells. It is a key resource in distributed environment which influences the tasks accomplishing time. The issues which troubles data locality are cluster and network load, resource sharing, cluster environment, size of data blocks, number of mappers and reducers. This paper aims to review various algorithms that are aware of data locality in scheduling, along with their strengths and weaknesses.
Hadoop MapReduce数据局部性研究综述
MapReduce已经成为一个强大的模型,用于处理大型数据集的并行和分布式数据。作为MapReduce的开源实现,Hadoop已经广泛认可了MapReduce。Hadoop将输入文件分割成多个数据块,分配给集群中的各个datanode。Hadoop必须提供有效的调度来高效地处理这些数据块。数据局部性是影响MapReduce高效处理的关键问题之一,它是由网络开销引起的。数据局部性用于将计算移动到其所在数据附近。它是分布式环境下影响任务完成时间的关键资源。影响数据局部性的问题包括集群和网络负载、资源共享、集群环境、数据块大小、映射器和reducer的数量。本文旨在回顾各种在调度中意识到数据局部性的算法,以及它们的优缺点。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信