利用修改后的动态文件分块技术进行重复数据删除以实现大数据挖掘

International Journal of Computing and Digital Systems Pub Date : 2024-07-01 DOI:10.12785/ijcds/160105

Saja Taha Ahmed

{"title":"利用修改后的动态文件分块技术进行重复数据删除以实现大数据挖掘","authors":"Saja Taha Ahmed","doi":"10.12785/ijcds/160105","DOIUrl":null,"url":null,"abstract":": The unpredictability of data growth necessitates data management to make optimum use of storage capacity. An innovative strategy for data deduplication is suggested in this study. The file is split into blocks of a predefined size by the predefined-size DeDuplication algorithm. The primary problem with this strategy is that the preceding sections will be relocated from their original placements if additional sections are inserted into the forefront or center of a file. As a result, the generated chunks will have a new hash value, resulting in a lower DeDuplication ratio. To overcome this drawback, this study suggests multiple characters as content-defined chunking breakpoints, which mostly depend on file internal representation and have variable chunk sizes. The experimental result shows significant improvement in the redundancy removal ratio of the Linux dataset. So, a comparison is made between the proposed fixed and dynamic deduplication stating that dynamic chunking has less average chunk size and can gain a much higher deduplication ratio.","PeriodicalId":37180,"journal":{"name":"International Journal of Computing and Digital Systems","volume":"91 2","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Deduplication using Modified Dynamic File Chunking for Big Data Mining\",\"authors\":\"Saja Taha Ahmed\",\"doi\":\"10.12785/ijcds/160105\",\"DOIUrl\":null,\"url\":null,\"abstract\":\": The unpredictability of data growth necessitates data management to make optimum use of storage capacity. An innovative strategy for data deduplication is suggested in this study. The file is split into blocks of a predefined size by the predefined-size DeDuplication algorithm. The primary problem with this strategy is that the preceding sections will be relocated from their original placements if additional sections are inserted into the forefront or center of a file. As a result, the generated chunks will have a new hash value, resulting in a lower DeDuplication ratio. To overcome this drawback, this study suggests multiple characters as content-defined chunking breakpoints, which mostly depend on file internal representation and have variable chunk sizes. The experimental result shows significant improvement in the redundancy removal ratio of the Linux dataset. So, a comparison is made between the proposed fixed and dynamic deduplication stating that dynamic chunking has less average chunk size and can gain a much higher deduplication ratio.\",\"PeriodicalId\":37180,\"journal\":{\"name\":\"International Journal of Computing and Digital Systems\",\"volume\":\"91 2\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Computing and Digital Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.12785/ijcds/160105\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Computing and Digital Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.12785/ijcds/160105","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

:数据增长的不可预测性要求数据管理部门充分利用存储容量。本研究提出了一种创新的重复数据删除策略。通过预定义大小的重复数据删除算法，将文件分割成预定义大小的块。这种策略的主要问题是，如果在文件的最前端或中心插入额外的部分，前面的部分就会从原来的位置重新定位。因此，生成的数据块将具有新的哈希值，导致重复数据删除率降低。为了克服这一缺点，本研究建议使用多个字符作为内容定义的分块断点，这些断点主要取决于文件的内部表示法，并且具有可变的分块大小。实验结果表明，Linux 数据集的冗余去除率有了显著提高。因此，在对建议的固定重复数据删除和动态重复数据删除进行比较后发现，动态分块的平均分块大小较小，可以获得更高的重复数据删除率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Deduplication using Modified Dynamic File Chunking for Big Data Mining

: The unpredictability of data growth necessitates data management to make optimum use of storage capacity. An innovative strategy for data deduplication is suggested in this study. The file is split into blocks of a predefined size by the predefined-size DeDuplication algorithm. The primary problem with this strategy is that the preceding sections will be relocated from their original placements if additional sections are inserted into the forefront or center of a file. As a result, the generated chunks will have a new hash value, resulting in a lower DeDuplication ratio. To overcome this drawback, this study suggests multiple characters as content-defined chunking breakpoints, which mostly depend on file internal representation and have variable chunk sizes. The experimental result shows significant improvement in the redundancy removal ratio of the Linux dataset. So, a comparison is made between the proposed fixed and dynamic deduplication stating that dynamic chunking has less average chunk size and can gain a much higher deduplication ratio.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Journal of Computing and Digital Systems Business, Management and Accounting-Management of Technology and Innovation

CiteScore

1.70

自引率

0.00%

发文量

111