利用修改后的动态文件分块技术进行重复数据删除以实现大数据挖掘

Saja Taha Ahmed
{"title":"利用修改后的动态文件分块技术进行重复数据删除以实现大数据挖掘","authors":"Saja Taha Ahmed","doi":"10.12785/ijcds/160105","DOIUrl":null,"url":null,"abstract":": The unpredictability of data growth necessitates data management to make optimum use of storage capacity. An innovative strategy for data deduplication is suggested in this study. The file is split into blocks of a predefined size by the predefined-size DeDuplication algorithm. The primary problem with this strategy is that the preceding sections will be relocated from their original placements if additional sections are inserted into the forefront or center of a file. As a result, the generated chunks will have a new hash value, resulting in a lower DeDuplication ratio. To overcome this drawback, this study suggests multiple characters as content-defined chunking breakpoints, which mostly depend on file internal representation and have variable chunk sizes. The experimental result shows significant improvement in the redundancy removal ratio of the Linux dataset. So, a comparison is made between the proposed fixed and dynamic deduplication stating that dynamic chunking has less average chunk size and can gain a much higher deduplication ratio.","PeriodicalId":37180,"journal":{"name":"International Journal of Computing and Digital Systems","volume":"91 2","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Deduplication using Modified Dynamic File Chunking for Big Data Mining\",\"authors\":\"Saja Taha Ahmed\",\"doi\":\"10.12785/ijcds/160105\",\"DOIUrl\":null,\"url\":null,\"abstract\":\": The unpredictability of data growth necessitates data management to make optimum use of storage capacity. An innovative strategy for data deduplication is suggested in this study. The file is split into blocks of a predefined size by the predefined-size DeDuplication algorithm. The primary problem with this strategy is that the preceding sections will be relocated from their original placements if additional sections are inserted into the forefront or center of a file. As a result, the generated chunks will have a new hash value, resulting in a lower DeDuplication ratio. To overcome this drawback, this study suggests multiple characters as content-defined chunking breakpoints, which mostly depend on file internal representation and have variable chunk sizes. The experimental result shows significant improvement in the redundancy removal ratio of the Linux dataset. So, a comparison is made between the proposed fixed and dynamic deduplication stating that dynamic chunking has less average chunk size and can gain a much higher deduplication ratio.\",\"PeriodicalId\":37180,\"journal\":{\"name\":\"International Journal of Computing and Digital Systems\",\"volume\":\"91 2\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Computing and Digital Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.12785/ijcds/160105\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Computing and Digital Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.12785/ijcds/160105","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

:数据增长的不可预测性要求数据管理部门充分利用存储容量。本研究提出了一种创新的重复数据删除策略。通过预定义大小的重复数据删除算法,将文件分割成预定义大小的块。这种策略的主要问题是,如果在文件的最前端或中心插入额外的部分,前面的部分就会从原来的位置重新定位。因此,生成的数据块将具有新的哈希值,导致重复数据删除率降低。为了克服这一缺点,本研究建议使用多个字符作为内容定义的分块断点,这些断点主要取决于文件的内部表示法,并且具有可变的分块大小。实验结果表明,Linux 数据集的冗余去除率有了显著提高。因此,在对建议的固定重复数据删除和动态重复数据删除进行比较后发现,动态分块的平均分块大小较小,可以获得更高的重复数据删除率。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Deduplication using Modified Dynamic File Chunking for Big Data Mining
: The unpredictability of data growth necessitates data management to make optimum use of storage capacity. An innovative strategy for data deduplication is suggested in this study. The file is split into blocks of a predefined size by the predefined-size DeDuplication algorithm. The primary problem with this strategy is that the preceding sections will be relocated from their original placements if additional sections are inserted into the forefront or center of a file. As a result, the generated chunks will have a new hash value, resulting in a lower DeDuplication ratio. To overcome this drawback, this study suggests multiple characters as content-defined chunking breakpoints, which mostly depend on file internal representation and have variable chunk sizes. The experimental result shows significant improvement in the redundancy removal ratio of the Linux dataset. So, a comparison is made between the proposed fixed and dynamic deduplication stating that dynamic chunking has less average chunk size and can gain a much higher deduplication ratio.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
International Journal of Computing and Digital Systems
International Journal of Computing and Digital Systems Business, Management and Accounting-Management of Technology and Innovation
CiteScore
1.70
自引率
0.00%
发文量
111
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信