Addressing Hadoop's Small File Problem With an Appendable Archive File Format

T. Renner, Johannes Müller, L. Thamsen, O. Kao
{"title":"Addressing Hadoop's Small File Problem With an Appendable Archive File Format","authors":"T. Renner, Johannes Müller, L. Thamsen, O. Kao","doi":"10.1145/3075564.3078888","DOIUrl":null,"url":null,"abstract":"Hadoop has been used widely for data analytic tasks in various domains. At the same time, data volume is expected to grow even further in the next years. Hadoop recently introduced the concept Archival Storage, an automated tiered storage technique for increasing storage capacity for long-term storage. However, Hadoop Distributed File System's scalability is limited by the total number of files that can be stored, and it is likely that the number of files increases fast when using it for archival purposes. This paper presents an approach for improving HDFS' scalability when using it as an archival storage. We present a tool that extends Hadoop Archive to an appendable file format. New files are appended to one of the existing archive data files efficiently without rewriting the whole archive. Therefore, a first fit algorithm is used to fill up the often not fully utilized fixed-sized data blocks of the archive data files. Index files are updated using a red-black tree providing guaranteed fast lookup and insert performance. We show that the tool performs well for different sizes of archives and number of files to add. By distributing new files efficiently, we also reduce the number of data blocks needed for archiving and, thus, reduce the memory footprint on the NameNode.","PeriodicalId":398898,"journal":{"name":"Proceedings of the Computing Frontiers Conference","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Computing Frontiers Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3075564.3078888","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8

Abstract

Hadoop has been used widely for data analytic tasks in various domains. At the same time, data volume is expected to grow even further in the next years. Hadoop recently introduced the concept Archival Storage, an automated tiered storage technique for increasing storage capacity for long-term storage. However, Hadoop Distributed File System's scalability is limited by the total number of files that can be stored, and it is likely that the number of files increases fast when using it for archival purposes. This paper presents an approach for improving HDFS' scalability when using it as an archival storage. We present a tool that extends Hadoop Archive to an appendable file format. New files are appended to one of the existing archive data files efficiently without rewriting the whole archive. Therefore, a first fit algorithm is used to fill up the often not fully utilized fixed-sized data blocks of the archive data files. Index files are updated using a red-black tree providing guaranteed fast lookup and insert performance. We show that the tool performs well for different sizes of archives and number of files to add. By distributing new files efficiently, we also reduce the number of data blocks needed for archiving and, thus, reduce the memory footprint on the NameNode.
用可附加的存档文件格式解决Hadoop的小文件问题
Hadoop已被广泛用于各个领域的数据分析任务。与此同时,预计未来几年数据量将进一步增长。Hadoop最近引入了归档存储的概念,这是一种自动分层存储技术,用于增加长期存储的存储容量。然而,Hadoop分布式文件系统的可伸缩性受到可以存储的文件总数的限制,并且当将其用于存档目的时,文件数量可能会快速增加。本文提出了一种提高HDFS作为归档存储时的可扩展性的方法。我们提供了一个将Hadoop Archive扩展为可附加文件格式的工具。新文件被有效地附加到现有的存档数据文件之一,而无需重写整个存档。因此,采用首次拟合算法来填充归档数据文件中通常未被充分利用的固定大小的数据块。索引文件使用红黑树进行更新,从而保证快速查找和插入性能。我们展示了该工具对于不同大小的归档和要添加的文件数量表现良好。通过有效地分发新文件,我们还减少了归档所需的数据块数量,从而减少了NameNode上的内存占用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信