小文件对Hadoop性能的影响:文献综述和开放点

T. El-Sayed, M. Badawy, A. El-Sayed
{"title":"小文件对Hadoop性能的影响:文献综述和开放点","authors":"T. El-Sayed, M. Badawy, A. El-Sayed","doi":"10.21608/mjeer.2019.62728","DOIUrl":null,"url":null,"abstract":"Hadoop is an open-source framework written by java and used for bigdata processing. It consists of two main components: HadoopDistributed File System (HDFS) and MapReduce. HDFS is used tostore data while MapReduce is used to distribute and process anapplication tasks in a distributed processing form. Recently, severalresearchers employ Hadoop for processing big data. The resultsindicate that Hadoop performs well with Large Files (files larger thanData Node block size). Nevertheless, Hadoop performance decreaseswith small files that are less than its block size. This is because, smallfiles consume the memory of both the DataNode and the NameNode,and increases the execution time of the applications (i.e. decreasesMapReduce performance). In this paper, the problem of the small filesin Hadoop is defined and the existing approaches to solve this problemare classified and discussed. In addition, some open points that mustbe considered when thinking of a better approach to improve theHadoop performance when processing the small files.","PeriodicalId":218019,"journal":{"name":"Menoufia Journal of Electronic Engineering Research","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Impact of Small Files on Hadoop Performance: Literature Survey and Open Points\",\"authors\":\"T. El-Sayed, M. Badawy, A. El-Sayed\",\"doi\":\"10.21608/mjeer.2019.62728\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Hadoop is an open-source framework written by java and used for bigdata processing. It consists of two main components: HadoopDistributed File System (HDFS) and MapReduce. HDFS is used tostore data while MapReduce is used to distribute and process anapplication tasks in a distributed processing form. Recently, severalresearchers employ Hadoop for processing big data. The resultsindicate that Hadoop performs well with Large Files (files larger thanData Node block size). Nevertheless, Hadoop performance decreaseswith small files that are less than its block size. This is because, smallfiles consume the memory of both the DataNode and the NameNode,and increases the execution time of the applications (i.e. decreasesMapReduce performance). In this paper, the problem of the small filesin Hadoop is defined and the existing approaches to solve this problemare classified and discussed. In addition, some open points that mustbe considered when thinking of a better approach to improve theHadoop performance when processing the small files.\",\"PeriodicalId\":218019,\"journal\":{\"name\":\"Menoufia Journal of Electronic Engineering Research\",\"volume\":\"45 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Menoufia Journal of Electronic Engineering Research\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21608/mjeer.2019.62728\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Menoufia Journal of Electronic Engineering Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21608/mjeer.2019.62728","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

摘要

Hadoop是一个由java编写的开源框架,用于大数据处理。它由两个主要组件组成:HDFS (HadoopDistributed File System)和MapReduce。HDFS用于存储数据,MapReduce用于以分布式处理形式分发和处理应用程序任务。最近,一些研究人员使用Hadoop来处理大数据。结果表明,Hadoop在处理大文件(大于数据节点块大小的文件)时表现良好。然而,Hadoop的性能会随着小于其块大小的小文件而下降。这是因为,smallfiles消耗DataNode和NameNode的内存,并增加应用程序的执行时间(即降低mapreduce性能)。本文对Hadoop中的小文件问题进行了定义,并对现有的解决该问题的方法进行了分类和讨论。此外,在考虑更好的方法来提高hadoop处理小文件时的性能时,必须考虑一些开放点。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Impact of Small Files on Hadoop Performance: Literature Survey and Open Points
Hadoop is an open-source framework written by java and used for bigdata processing. It consists of two main components: HadoopDistributed File System (HDFS) and MapReduce. HDFS is used tostore data while MapReduce is used to distribute and process anapplication tasks in a distributed processing form. Recently, severalresearchers employ Hadoop for processing big data. The resultsindicate that Hadoop performs well with Large Files (files larger thanData Node block size). Nevertheless, Hadoop performance decreaseswith small files that are less than its block size. This is because, smallfiles consume the memory of both the DataNode and the NameNode,and increases the execution time of the applications (i.e. decreasesMapReduce performance). In this paper, the problem of the small filesin Hadoop is defined and the existing approaches to solve this problemare classified and discussed. In addition, some open points that mustbe considered when thinking of a better approach to improve theHadoop performance when processing the small files.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信