Small Files Problem Resolution via Hierarchical Clustering Algorithm.

IF 4.6 Q2 MATERIALS SCIENCE, BIOMATERIALS
ACS Applied Bio Materials Pub Date : 2024-01-01 Epub Date: 2023-05-16 DOI:10.1089/big.2022.0181
Oded Koren, Aviel Shamalov, Nir Perel
{"title":"Small Files Problem Resolution via Hierarchical Clustering Algorithm.","authors":"Oded Koren, Aviel Shamalov, Nir Perel","doi":"10.1089/big.2022.0181","DOIUrl":null,"url":null,"abstract":"<p><p>The Small Files Problem in Hadoop Distributed File System (HDFS) is an ongoing challenge that has not yet been solved. However, various approaches have been developed to tackle the obstacles this problem creates. Properly managing the size of blocks in a file system is essential as it saves memory and computing time and may reduce bottlenecks. In this article, a new approach using a Hierarchical Clustering Algorithm is suggested for dealing with small files. The proposed method identifies the files by their structure and via a special Dendrogram analysis, and then recommends which files can be merged. As a simulation, the proposed algorithm was applied via 100 CSV files with different structures, containing 2-4 columns with different data types (integer, decimal and text). Also, 20 files that were not CSV files were created to demonstrate that the algorithm only works on CSV files. All data were analyzed via a machine learning hierarchical clustering method, and a Dendrogram was created. According to the merge process that was performed, seven files from the Dendrogram analysis were chosen as appropriate files to be merged. This reduced the memory space in the HDFS. Furthermore, the results showed that using the suggested algorithm led to efficient file management.</p>","PeriodicalId":2,"journal":{"name":"ACS Applied Bio Materials","volume":null,"pages":null},"PeriodicalIF":4.6000,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACS Applied Bio Materials","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1089/big.2022.0181","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2023/5/16 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"MATERIALS SCIENCE, BIOMATERIALS","Score":null,"Total":0}
引用次数: 0

Abstract

The Small Files Problem in Hadoop Distributed File System (HDFS) is an ongoing challenge that has not yet been solved. However, various approaches have been developed to tackle the obstacles this problem creates. Properly managing the size of blocks in a file system is essential as it saves memory and computing time and may reduce bottlenecks. In this article, a new approach using a Hierarchical Clustering Algorithm is suggested for dealing with small files. The proposed method identifies the files by their structure and via a special Dendrogram analysis, and then recommends which files can be merged. As a simulation, the proposed algorithm was applied via 100 CSV files with different structures, containing 2-4 columns with different data types (integer, decimal and text). Also, 20 files that were not CSV files were created to demonstrate that the algorithm only works on CSV files. All data were analyzed via a machine learning hierarchical clustering method, and a Dendrogram was created. According to the merge process that was performed, seven files from the Dendrogram analysis were chosen as appropriate files to be merged. This reduced the memory space in the HDFS. Furthermore, the results showed that using the suggested algorithm led to efficient file management.

通过分层聚类算法解决小文件问题
Hadoop 分布式文件系统(HDFS)中的小文件问题是一个持续存在的挑战,至今尚未解决。不过,人们已经开发出各种方法来解决这一问题带来的障碍。在文件系统中适当管理块的大小至关重要,因为这样可以节省内存和计算时间,并可减少瓶颈。本文提出了一种使用分层聚类算法处理小文件的新方法。建议的方法通过文件结构和特殊的树枝图分析来识别文件,然后推荐哪些文件可以合并。作为模拟,建议的算法在 100 个不同结构的 CSV 文件中应用,这些文件包含 2-4 列不同的数据类型(整数、小数和文本)。此外,还创建了 20 个非 CSV 文件,以证明该算法仅适用于 CSV 文件。所有数据都通过机器学习分层聚类方法进行了分析,并创建了树枝图。根据所执行的合并程序,从树枝图分析中选择了七个文件作为适当的文件进行合并。这减少了 HDFS 的内存空间。此外,结果表明,使用建议的算法可实现高效的文件管理。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
ACS Applied Bio Materials
ACS Applied Bio Materials Chemistry-Chemistry (all)
CiteScore
9.40
自引率
2.10%
发文量
464
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信