用于高效重复数据删除后增量压缩的快速轻量级相似性检测设计

IF 2.6 3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage Pub Date : 2023-02-16 DOI:10.1145/3584663

Wen Xia, Lifeng Pu, Xiangyu Zou, Philip Shilane, Shiyi Li, Haijun Zhang, Xuan Wang

{"title":"用于高效重复数据删除后增量压缩的快速轻量级相似性检测设计","authors":"Wen Xia, Lifeng Pu, Xiangyu Zou, Philip Shilane, Shiyi Li, Haijun Zhang, Xuan Wang","doi":"10.1145/3584663","DOIUrl":null,"url":null,"abstract":"Post-deduplication delta compression is a data reduction technique that calculates and stores the differences of very similar but non-duplicate chunks in storage systems, which is able to achieve a very high compression ratio. However, the low throughput of widely used resemblance detection approaches (e.g., N-Transform) usually becomes the bottleneck of delta compression systems due to introducing high computational overhead. Generally, this overhead mainly consists of two parts: ① calculating the rolling hash byte by byte across data chunks and ② applying multiple transforms on all of the calculated rolling hash values. In this article, we propose Odess, a fast and lightweight resemblance detection approach, that greatly reduces the computational overhead for resemblance detection while achieving high detection accuracy and a high compression ratio. Odess first utilizes a novel Subwindow-based Parallel Rolling (SWPR) hash method using Single Instruction Multiple Data [1] (SIMD) to accelerate calculation of rolling hashes (corresponding to the first part of the overhead). Odess then uses a novel Content-Defined Sampling method to generate a much smaller proxy hash set from the whole rolling hash set and quickly applies transforms on this small hash set for resemblance detection (corresponding to the second part of the overhead). Evaluation results show that during the stage of resemblance detection, the Odess approach is ∼31.4× and ∼7.9× faster than the state-of-the-art N-Transform and Finesse (a recent variant of N-Transform [39]), respectively. When considering an end-to-end data reduction storage system, the Odess-based system’s throughput is about 3.20× and 1.41× higher than the N-Transform- and Finesse-based systems’ throughput, respectively, while maintaining the high compression ratio of N-Transform and achieving ∼1.22× higher compression ratio over Finesse.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":"109 3","pages":"1 - 30"},"PeriodicalIF":2.6000,"publicationDate":"2023-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"The Design of Fast and Lightweight Resemblance Detection for Efficient Post-Deduplication Delta Compression\",\"authors\":\"Wen Xia, Lifeng Pu, Xiangyu Zou, Philip Shilane, Shiyi Li, Haijun Zhang, Xuan Wang\",\"doi\":\"10.1145/3584663\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Post-deduplication delta compression is a data reduction technique that calculates and stores the differences of very similar but non-duplicate chunks in storage systems, which is able to achieve a very high compression ratio. However, the low throughput of widely used resemblance detection approaches (e.g., N-Transform) usually becomes the bottleneck of delta compression systems due to introducing high computational overhead. Generally, this overhead mainly consists of two parts: ① calculating the rolling hash byte by byte across data chunks and ② applying multiple transforms on all of the calculated rolling hash values. In this article, we propose Odess, a fast and lightweight resemblance detection approach, that greatly reduces the computational overhead for resemblance detection while achieving high detection accuracy and a high compression ratio. Odess first utilizes a novel Subwindow-based Parallel Rolling (SWPR) hash method using Single Instruction Multiple Data [1] (SIMD) to accelerate calculation of rolling hashes (corresponding to the first part of the overhead). Odess then uses a novel Content-Defined Sampling method to generate a much smaller proxy hash set from the whole rolling hash set and quickly applies transforms on this small hash set for resemblance detection (corresponding to the second part of the overhead). Evaluation results show that during the stage of resemblance detection, the Odess approach is ∼31.4× and ∼7.9× faster than the state-of-the-art N-Transform and Finesse (a recent variant of N-Transform [39]), respectively. When considering an end-to-end data reduction storage system, the Odess-based system’s throughput is about 3.20× and 1.41× higher than the N-Transform- and Finesse-based systems’ throughput, respectively, while maintaining the high compression ratio of N-Transform and achieving ∼1.22× higher compression ratio over Finesse.\",\"PeriodicalId\":49113,\"journal\":{\"name\":\"ACM Transactions on Storage\",\"volume\":\"109 3\",\"pages\":\"1 - 30\"},\"PeriodicalIF\":2.6000,\"publicationDate\":\"2023-02-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Storage\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/3584663\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Storage","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3584663","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

重复数据消除后增量压缩是一种数据缩减技术，它计算并存储存储系统中非常相似但不重复的块的差异，从而能够实现非常高的压缩率。然而，由于引入了高计算开销，广泛使用的相似性检测方法（例如，N变换）的低吞吐量通常成为delta压缩系统的瓶颈。通常，这种开销主要由两部分组成：①跨数据块逐字节计算滚动哈希；②对所有计算出的滚动哈希值进行多次转换。在本文中，我们提出了一种快速、轻量级的相似性检测方法Odess，它大大减少了相似性检测的计算开销，同时实现了高检测精度和高压缩比。Odess首先利用了一种新颖的基于子窗口的并行滚动（SWPR）哈希方法，该方法使用单指令多数据[1]（SIMD）来加速滚动哈希的计算（对应于开销的第一部分）。然后，Odess使用一种新颖的内容定义采样方法从整个滚动哈希集生成一个小得多的代理哈希集，并在这个小哈希集上快速应用变换进行相似性检测（对应于开销的第二部分）。评估结果表明，在相似性检测阶段，Odess方法分别比最先进的N-变换和Finesse（N-变换的最新变体[39]）快31.4倍和7.9倍。当考虑端到端数据缩减存储系统时，基于Odess的系统的吞吐量分别比基于N-变换和Finesse的系统的吞吐率高3.20倍和1.41倍，同时保持N-变换的高压缩比，并实现比Finesse高约1.22倍的压缩比。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

The Design of Fast and Lightweight Resemblance Detection for Efficient Post-Deduplication Delta Compression

Post-deduplication delta compression is a data reduction technique that calculates and stores the differences of very similar but non-duplicate chunks in storage systems, which is able to achieve a very high compression ratio. However, the low throughput of widely used resemblance detection approaches (e.g., N-Transform) usually becomes the bottleneck of delta compression systems due to introducing high computational overhead. Generally, this overhead mainly consists of two parts: ① calculating the rolling hash byte by byte across data chunks and ② applying multiple transforms on all of the calculated rolling hash values. In this article, we propose Odess, a fast and lightweight resemblance detection approach, that greatly reduces the computational overhead for resemblance detection while achieving high detection accuracy and a high compression ratio. Odess first utilizes a novel Subwindow-based Parallel Rolling (SWPR) hash method using Single Instruction Multiple Data [1] (SIMD) to accelerate calculation of rolling hashes (corresponding to the first part of the overhead). Odess then uses a novel Content-Defined Sampling method to generate a much smaller proxy hash set from the whole rolling hash set and quickly applies transforms on this small hash set for resemblance detection (corresponding to the second part of the overhead). Evaluation results show that during the stage of resemblance detection, the Odess approach is ∼31.4× and ∼7.9× faster than the state-of-the-art N-Transform and Finesse (a recent variant of N-Transform [39]), respectively. When considering an end-to-end data reduction storage system, the Odess-based system’s throughput is about 3.20× and 1.41× higher than the N-Transform- and Finesse-based systems’ throughput, respectively, while maintaining the high compression ratio of N-Transform and achieving ∼1.22× higher compression ratio over Finesse.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM Transactions on Storage COMPUTER SCIENCE, HARDWARE & ARCHITECTURE-COMPUTER SCIENCE, SOFTWARE ENGINEERING

CiteScore

4.20

自引率

5.90%

发文量

审稿时长

>12 weeks

期刊介绍： The ACM Transactions on Storage (TOS) is a new journal with an intent to publish original archival papers in the area of storage and closely related disciplines. Articles that appear in TOS will tend either to present new techniques and concepts or to report novel experiences and experiments with practical systems. Storage is a broad and multidisciplinary area that comprises of network protocols, resource management, data backup, replication, recovery, devices, security, and theory of data coding, densities, and low-power. Potential synergies among these fields are expected to open up new research directions.