A Review on Data locality in Hadoop MapReduce

2018 Fifth International Conference on Parallel, Distributed and Grid Computing (PDGC) Pub Date : 2018-12-01 DOI:10.1109/PDGC.2018.8745928

Anil Sharma, Gurwinder Singh

引用次数: 4

Abstract

MapReduce has emerged as a strong model for processing parallel and distributed data for huge datasets. Hadoop an open source implementation of MapReduce has approved MapReduce widely. Hadoop fragments the input file into number of data blocks to allocate them to various DataNodes in cluster. Hadoop must provide effective scheduling to process these data blocks in efficient way. One of the issues that play vital role in efficient processing of MapReduce is Data Locality which is caused due to overhead of network. Data locality is equipped for moving the computation adjacent to the data where it dwells. It is a key resource in distributed environment which influences the tasks accomplishing time. The issues which troubles data locality are cluster and network load, resource sharing, cluster environment, size of data blocks, number of mappers and reducers. This paper aims to review various algorithms that are aware of data locality in scheduling, along with their strengths and weaknesses.

查看原文本刊更多论文

Hadoop MapReduce数据局部性研究综述

MapReduce已经成为一个强大的模型，用于处理大型数据集的并行和分布式数据。作为MapReduce的开源实现，Hadoop已经广泛认可了MapReduce。Hadoop将输入文件分割成多个数据块，分配给集群中的各个datanode。Hadoop必须提供有效的调度来高效地处理这些数据块。数据局部性是影响MapReduce高效处理的关键问题之一，它是由网络开销引起的。数据局部性用于将计算移动到其所在数据附近。它是分布式环境下影响任务完成时间的关键资源。影响数据局部性的问题包括集群和网络负载、资源共享、集群环境、数据块大小、映射器和reducer的数量。本文旨在回顾各种在调度中意识到数据局部性的算法，以及它们的优缺点。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 Fifth International Conference on Parallel, Distributed and Grid Computing (PDGC)

自引率

0.00%

发文量