A Thorough Investigation of Content-Defined Chunking Algorithms for Data Deduplication

Marcel Gregoriadis, Leonhard Balduf, Björn Scheuermann, Johan Pouwelse
{"title":"A Thorough Investigation of Content-Defined Chunking Algorithms for Data Deduplication","authors":"Marcel Gregoriadis, Leonhard Balduf, Björn Scheuermann, Johan Pouwelse","doi":"arxiv-2409.06066","DOIUrl":null,"url":null,"abstract":"Data deduplication emerged as a powerful solution for reducing storage and\nbandwidth costs by eliminating redundancies at the level of chunks. This has\nspurred the development of numerous Content-Defined Chunking (CDC) algorithms\nover the past two decades. Despite advancements, the current state-of-the-art\nremains obscure, as a thorough and impartial analysis and comparison is\nlacking. We conduct a rigorous theoretical analysis and impartial experimental\ncomparison of several leading CDC algorithms. Using four realistic datasets, we\nevaluate these algorithms against four key metrics: throughput, deduplication\nratio, average chunk size, and chunk-size variance. Our analyses, in many\ninstances, extend the findings of their original publications by reporting new\nresults and putting existing ones into context. Moreover, we highlight\nlimitations that have previously gone unnoticed. Our findings provide valuable\ninsights that inform the selection and optimization of CDC algorithms for\npractical applications in data deduplication.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"58 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Distributed, Parallel, and Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06066","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Data deduplication emerged as a powerful solution for reducing storage and bandwidth costs by eliminating redundancies at the level of chunks. This has spurred the development of numerous Content-Defined Chunking (CDC) algorithms over the past two decades. Despite advancements, the current state-of-the-art remains obscure, as a thorough and impartial analysis and comparison is lacking. We conduct a rigorous theoretical analysis and impartial experimental comparison of several leading CDC algorithms. Using four realistic datasets, we evaluate these algorithms against four key metrics: throughput, deduplication ratio, average chunk size, and chunk-size variance. Our analyses, in many instances, extend the findings of their original publications by reporting new results and putting existing ones into context. Moreover, we highlight limitations that have previously gone unnoticed. Our findings provide valuable insights that inform the selection and optimization of CDC algorithms for practical applications in data deduplication.
对用于重复数据删除的内容定义分块算法的深入研究
重复数据删除是通过消除块级冗余来降低存储和带宽成本的强大解决方案。这促使在过去二十年中开发了大量内容定义分块(CDC)算法。尽管取得了进步,但由于缺乏全面、公正的分析和比较,目前的先进水平仍不明显。我们对几种领先的 CDC 算法进行了严格的理论分析和公正的实验比较。我们使用四个现实数据集,根据四个关键指标对这些算法进行了评估:吞吐量、重复数据删除比率、平均块大小和块大小差异。在许多情况下,我们的分析通过报告新结果并结合现有结果,扩展了原始出版物的研究结果。此外,我们还强调了以前未被注意到的局限性。我们的研究结果提供了宝贵的见解,为重复数据删除实际应用中 CDC 算法的选择和优化提供了参考。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信