A Thorough Investigation of Content-Defined Chunking Algorithms for Data Deduplication

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-09-09 DOI:arxiv-2409.06066

Marcel Gregoriadis, Leonhard Balduf, Björn Scheuermann, Johan Pouwelse

引用次数: 0

Abstract

Data deduplication emerged as a powerful solution for reducing storage and bandwidth costs by eliminating redundancies at the level of chunks. This has spurred the development of numerous Content-Defined Chunking (CDC) algorithms over the past two decades. Despite advancements, the current state-of-the-art remains obscure, as a thorough and impartial analysis and comparison is lacking. We conduct a rigorous theoretical analysis and impartial experimental comparison of several leading CDC algorithms. Using four realistic datasets, we evaluate these algorithms against four key metrics: throughput, deduplication ratio, average chunk size, and chunk-size variance. Our analyses, in many instances, extend the findings of their original publications by reporting new results and putting existing ones into context. Moreover, we highlight limitations that have previously gone unnoticed. Our findings provide valuable insights that inform the selection and optimization of CDC algorithms for practical applications in data deduplication.

查看原文本刊更多论文

对用于重复数据删除的内容定义分块算法的深入研究

重复数据删除是通过消除块级冗余来降低存储和带宽成本的强大解决方案。这促使在过去二十年中开发了大量内容定义分块（CDC）算法。尽管取得了进步，但由于缺乏全面、公正的分析和比较，目前的先进水平仍不明显。我们对几种领先的 CDC 算法进行了严格的理论分析和公正的实验比较。我们使用四个现实数据集，根据四个关键指标对这些算法进行了评估：吞吐量、重复数据删除比率、平均块大小和块大小差异。在许多情况下，我们的分析通过报告新结果并结合现有结果，扩展了原始出版物的研究结果。此外，我们还强调了以前未被注意到的局限性。我们的研究结果提供了宝贵的见解，为重复数据删除实际应用中 CDC 算法的选择和优化提供了参考。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Distributed, Parallel, and Cluster Computing

自引率

0.00%

发文量