小文件对Hadoop性能的影响:文献综述和开放点

Menoufia Journal of Electronic Engineering Research Pub Date : 2019-01-01 DOI:10.21608/mjeer.2019.62728

T. El-Sayed, M. Badawy, A. El-Sayed

{"title":"小文件对Hadoop性能的影响:文献综述和开放点","authors":"T. El-Sayed, M. Badawy, A. El-Sayed","doi":"10.21608/mjeer.2019.62728","DOIUrl":null,"url":null,"abstract":"Hadoop is an open-source framework written by java and used for bigdata processing. It consists of two main components: HadoopDistributed File System (HDFS) and MapReduce. HDFS is used tostore data while MapReduce is used to distribute and process anapplication tasks in a distributed processing form. Recently, severalresearchers employ Hadoop for processing big data. The resultsindicate that Hadoop performs well with Large Files (files larger thanData Node block size). Nevertheless, Hadoop performance decreaseswith small files that are less than its block size. This is because, smallfiles consume the memory of both the DataNode and the NameNode,and increases the execution time of the applications (i.e. decreasesMapReduce performance). In this paper, the problem of the small filesin Hadoop is defined and the existing approaches to solve this problemare classified and discussed. In addition, some open points that mustbe considered when thinking of a better approach to improve theHadoop performance when processing the small files.","PeriodicalId":218019,"journal":{"name":"Menoufia Journal of Electronic Engineering Research","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Impact of Small Files on Hadoop Performance: Literature Survey and Open Points\",\"authors\":\"T. El-Sayed, M. Badawy, A. El-Sayed\",\"doi\":\"10.21608/mjeer.2019.62728\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Hadoop is an open-source framework written by java and used for bigdata processing. It consists of two main components: HadoopDistributed File System (HDFS) and MapReduce. HDFS is used tostore data while MapReduce is used to distribute and process anapplication tasks in a distributed processing form. Recently, severalresearchers employ Hadoop for processing big data. The resultsindicate that Hadoop performs well with Large Files (files larger thanData Node block size). Nevertheless, Hadoop performance decreaseswith small files that are less than its block size. This is because, smallfiles consume the memory of both the DataNode and the NameNode,and increases the execution time of the applications (i.e. decreasesMapReduce performance). In this paper, the problem of the small filesin Hadoop is defined and the existing approaches to solve this problemare classified and discussed. In addition, some open points that mustbe considered when thinking of a better approach to improve theHadoop performance when processing the small files.\",\"PeriodicalId\":218019,\"journal\":{\"name\":\"Menoufia Journal of Electronic Engineering Research\",\"volume\":\"45 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Menoufia Journal of Electronic Engineering Research\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21608/mjeer.2019.62728\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Menoufia Journal of Electronic Engineering Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21608/mjeer.2019.62728","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

Hadoop是一个由java编写的开源框架，用于大数据处理。它由两个主要组件组成:HDFS (HadoopDistributed File System)和MapReduce。HDFS用于存储数据，MapReduce用于以分布式处理形式分发和处理应用程序任务。最近，一些研究人员使用Hadoop来处理大数据。结果表明，Hadoop在处理大文件(大于数据节点块大小的文件)时表现良好。然而，Hadoop的性能会随着小于其块大小的小文件而下降。这是因为，smallfiles消耗DataNode和NameNode的内存，并增加应用程序的执行时间(即降低mapreduce性能)。本文对Hadoop中的小文件问题进行了定义，并对现有的解决该问题的方法进行了分类和讨论。此外，在考虑更好的方法来提高hadoop处理小文件时的性能时，必须考虑一些开放点。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Impact of Small Files on Hadoop Performance: Literature Survey and Open Points

Hadoop is an open-source framework written by java and used for bigdata processing. It consists of two main components: HadoopDistributed File System (HDFS) and MapReduce. HDFS is used tostore data while MapReduce is used to distribute and process anapplication tasks in a distributed processing form. Recently, severalresearchers employ Hadoop for processing big data. The resultsindicate that Hadoop performs well with Large Files (files larger thanData Node block size). Nevertheless, Hadoop performance decreaseswith small files that are less than its block size. This is because, smallfiles consume the memory of both the DataNode and the NameNode,and increases the execution time of the applications (i.e. decreasesMapReduce performance). In this paper, the problem of the small filesin Hadoop is defined and the existing approaches to solve this problemare classified and discussed. In addition, some open points that mustbe considered when thinking of a better approach to improve theHadoop performance when processing the small files.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Menoufia Journal of Electronic Engineering Research

自引率

0.00%

发文量