A High-performance Post-deduplication Delta Compression Scheme for Packed Datasets

2021 IEEE 39th International Conference on Computer Design (ICCD) Pub Date : 2021-10-01 DOI:10.1109/ICCD53106.2021.00078

Yucheng Zhang, Hong Jiang, Mengtian Shi, Chunzhi Wang, Nan Jiang, Xinyun Wu

{"title":"A High-performance Post-deduplication Delta Compression Scheme for Packed Datasets","authors":"Yucheng Zhang, Hong Jiang, Mengtian Shi, Chunzhi Wang, Nan Jiang, Xinyun Wu","doi":"10.1109/ICCD53106.2021.00078","DOIUrl":null,"url":null,"abstract":"Data deduplication has become a standard feature in most storage backup systems to reduce storage costs. In real-world deduplication-based backup products, small files are grouped into larger packed files prior to deduplication. For each file, the grouping entails a backup product inserting a metadata block immediately before the file contents. Since the contents of these metadata blocks vary with every backup, different backup streams of the packed files from the same or highly similar small files will contain chunks that are considered mostly unique by conventional deduplication. That is, most of the contents among these unique chunks in different backups are identical, except for metadata blocks. Delta compression is able to remove those redundancy but cannot be applied to backup storage because the extra I/Os required to retrieve the base chunks significantly decrease backup throughput. If there are many grouped small files in the backup datasets, some duplicate chunks, called persistent fragmented chunks (PFCs), may be rewritten repeatedly. We observe that PFCs are often surrounded by substantial unique chunks containing metadata blocks. In this paper, we propose a PFC-inspired delta compression scheme to efficiently perform delta compression for unique chunks surrounding identical PFCs.In the process of deduplication, containers holding previous copies of the chunks being considered for storage will be accessed for prefetching metadata to accelerate the detection of duplicates. The main idea behind our scheme is to identify containers holding PFCs and prefetch chunks in those containers by piggybacking on the reads for prefetching metadata when they are accessed during deduplication. Base chunks for delta compression are then detected from the prefetched chunks, thus eliminating extra I/Os for retrieving the base chunks. Experimental results show that PFC-inspired delta compression attains additional data reduction by about 2x on top of data deduplications and accelerates the restore speed by 8.6%-49.3%, while moderately sacrificing the backup throughput by 0.5%-11.9%.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"15 4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE 39th International Conference on Computer Design (ICCD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCD53106.2021.00078","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Data deduplication has become a standard feature in most storage backup systems to reduce storage costs. In real-world deduplication-based backup products, small files are grouped into larger packed files prior to deduplication. For each file, the grouping entails a backup product inserting a metadata block immediately before the file contents. Since the contents of these metadata blocks vary with every backup, different backup streams of the packed files from the same or highly similar small files will contain chunks that are considered mostly unique by conventional deduplication. That is, most of the contents among these unique chunks in different backups are identical, except for metadata blocks. Delta compression is able to remove those redundancy but cannot be applied to backup storage because the extra I/Os required to retrieve the base chunks significantly decrease backup throughput. If there are many grouped small files in the backup datasets, some duplicate chunks, called persistent fragmented chunks (PFCs), may be rewritten repeatedly. We observe that PFCs are often surrounded by substantial unique chunks containing metadata blocks. In this paper, we propose a PFC-inspired delta compression scheme to efficiently perform delta compression for unique chunks surrounding identical PFCs.In the process of deduplication, containers holding previous copies of the chunks being considered for storage will be accessed for prefetching metadata to accelerate the detection of duplicates. The main idea behind our scheme is to identify containers holding PFCs and prefetch chunks in those containers by piggybacking on the reads for prefetching metadata when they are accessed during deduplication. Base chunks for delta compression are then detected from the prefetched chunks, thus eliminating extra I/Os for retrieving the base chunks. Experimental results show that PFC-inspired delta compression attains additional data reduction by about 2x on top of data deduplications and accelerates the restore speed by 8.6%-49.3%, while moderately sacrificing the backup throughput by 0.5%-11.9%.

查看原文本刊更多论文

打包数据集的高性能重复数据删除后增量压缩方案

为了降低存储成本，重复数据删除已经成为大多数存储备份系统的标配特性。在实际的基于重复数据删除的备份产品中，在重复数据删除之前，小文件被分组到较大的打包文件中。对于每个文件，分组需要一个备份产品在文件内容之前插入一个元数据块。由于这些元数据块的内容随每次备份而变化，因此来自相同或高度相似的小文件的打包文件的不同备份流将包含传统重复数据删除认为主要是唯一的块。也就是说，除了元数据块之外，不同备份中这些唯一块中的大多数内容都是相同的。Delta压缩能够消除这些冗余，但不能应用于备份存储，因为检索基本块所需的额外I/ o显著降低了备份吞吐量。如果备份数据集中有许多分组的小文件，则可能会重复重写一些重复的块，称为持久性碎片块(pfc)。我们观察到pfc通常被包含元数据块的大量唯一块所包围。在本文中，我们提出了一个启发pfc的增量压缩方案，以有效地对相同pfc周围的唯一块执行增量压缩。在重复数据删除过程中，将访问包含考虑存储的块的先前副本的容器，以预取元数据，以加快重复项的检测。我们的方案背后的主要思想是识别持有pfc的容器，并在这些容器中预取块，方法是在重复数据删除期间访问元数据时，在读取数据时预取元数据。然后从预取的块中检测用于增量压缩的基本块，从而消除了用于检索基本块的额外I/ o。实验结果表明，pfc启发的增量压缩在重复数据删除的基础上实现了约2倍的额外数据减少，恢复速度提高了8.6% ~ 49.3%，同时适度牺牲了0.5% ~ 11.9%的备份吞吐量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE 39th International Conference on Computer Design (ICCD)

自引率

0.00%

发文量