{"title":"利用修改后的动态文件分块技术进行重复数据删除以实现大数据挖掘","authors":"Saja Taha Ahmed","doi":"10.12785/ijcds/160105","DOIUrl":null,"url":null,"abstract":": The unpredictability of data growth necessitates data management to make optimum use of storage capacity. An innovative strategy for data deduplication is suggested in this study. The file is split into blocks of a predefined size by the predefined-size DeDuplication algorithm. The primary problem with this strategy is that the preceding sections will be relocated from their original placements if additional sections are inserted into the forefront or center of a file. As a result, the generated chunks will have a new hash value, resulting in a lower DeDuplication ratio. To overcome this drawback, this study suggests multiple characters as content-defined chunking breakpoints, which mostly depend on file internal representation and have variable chunk sizes. The experimental result shows significant improvement in the redundancy removal ratio of the Linux dataset. So, a comparison is made between the proposed fixed and dynamic deduplication stating that dynamic chunking has less average chunk size and can gain a much higher deduplication ratio.","PeriodicalId":37180,"journal":{"name":"International Journal of Computing and Digital Systems","volume":"91 2","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Deduplication using Modified Dynamic File Chunking for Big Data Mining\",\"authors\":\"Saja Taha Ahmed\",\"doi\":\"10.12785/ijcds/160105\",\"DOIUrl\":null,\"url\":null,\"abstract\":\": The unpredictability of data growth necessitates data management to make optimum use of storage capacity. An innovative strategy for data deduplication is suggested in this study. The file is split into blocks of a predefined size by the predefined-size DeDuplication algorithm. The primary problem with this strategy is that the preceding sections will be relocated from their original placements if additional sections are inserted into the forefront or center of a file. As a result, the generated chunks will have a new hash value, resulting in a lower DeDuplication ratio. To overcome this drawback, this study suggests multiple characters as content-defined chunking breakpoints, which mostly depend on file internal representation and have variable chunk sizes. The experimental result shows significant improvement in the redundancy removal ratio of the Linux dataset. So, a comparison is made between the proposed fixed and dynamic deduplication stating that dynamic chunking has less average chunk size and can gain a much higher deduplication ratio.\",\"PeriodicalId\":37180,\"journal\":{\"name\":\"International Journal of Computing and Digital Systems\",\"volume\":\"91 2\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Computing and Digital Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.12785/ijcds/160105\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Computing and Digital Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.12785/ijcds/160105","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Deduplication using Modified Dynamic File Chunking for Big Data Mining
: The unpredictability of data growth necessitates data management to make optimum use of storage capacity. An innovative strategy for data deduplication is suggested in this study. The file is split into blocks of a predefined size by the predefined-size DeDuplication algorithm. The primary problem with this strategy is that the preceding sections will be relocated from their original placements if additional sections are inserted into the forefront or center of a file. As a result, the generated chunks will have a new hash value, resulting in a lower DeDuplication ratio. To overcome this drawback, this study suggests multiple characters as content-defined chunking breakpoints, which mostly depend on file internal representation and have variable chunk sizes. The experimental result shows significant improvement in the redundancy removal ratio of the Linux dataset. So, a comparison is made between the proposed fixed and dynamic deduplication stating that dynamic chunking has less average chunk size and can gain a much higher deduplication ratio.