Chunzhi Wang, Keguan Wang, Min Li, Feifei Wei, Neal Xiong
{"title":"Chunk2vec: A novel resemblance detection scheme based on Sentence-BERT for post-deduplication delta compression in network transmission","authors":"Chunzhi Wang, Keguan Wang, Min Li, Feifei Wei, Neal Xiong","doi":"10.1049/cmu2.12719","DOIUrl":null,"url":null,"abstract":"<p>Delta compression, as a complementary technique for data deduplication, has gained widespread attention in network storage systems. It can eliminate redundant data between non-duplicate but similar chunks that cannot be identified by data deduplication. The network transmission overhead between servers and clients can be greatly reduced by using data deduplication and delta compression techniques. Resemblance detection is a technique that identifies similar chunks for post-deduplication delta compression in network storage systems. The existing resemblance detection approaches fail to detect similar chunks with arbitrary similarity by setting a similarity threshold, which can be suboptimal. In this paper, the authors propose <i>Chunk2vec</i>, a resemblance detection scheme for delta compression that utilizes deep learning techniques and Approximate Nearest Neighbour Search technique to detect similar chunks with any given similarity range. Chunk2vec uses a deep neural network, Sentence-BERT, to extract an approximate feature vector for each chunk while preserving its similarity with other chunks. The experimental results on five real-world datasets indicate that Chunk2vec improves the accuracy of resemblance detection for delta compression and achieves higher compression ratio than the state-of-the-art resemblance detection technique.</p>","PeriodicalId":55001,"journal":{"name":"IET Communications","volume":"18 2","pages":"145-159"},"PeriodicalIF":1.5000,"publicationDate":"2024-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cmu2.12719","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IET Communications","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1049/cmu2.12719","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Delta compression, as a complementary technique for data deduplication, has gained widespread attention in network storage systems. It can eliminate redundant data between non-duplicate but similar chunks that cannot be identified by data deduplication. The network transmission overhead between servers and clients can be greatly reduced by using data deduplication and delta compression techniques. Resemblance detection is a technique that identifies similar chunks for post-deduplication delta compression in network storage systems. The existing resemblance detection approaches fail to detect similar chunks with arbitrary similarity by setting a similarity threshold, which can be suboptimal. In this paper, the authors propose Chunk2vec, a resemblance detection scheme for delta compression that utilizes deep learning techniques and Approximate Nearest Neighbour Search technique to detect similar chunks with any given similarity range. Chunk2vec uses a deep neural network, Sentence-BERT, to extract an approximate feature vector for each chunk while preserving its similarity with other chunks. The experimental results on five real-world datasets indicate that Chunk2vec improves the accuracy of resemblance detection for delta compression and achieves higher compression ratio than the state-of-the-art resemblance detection technique.
期刊介绍:
IET Communications covers the fundamental and generic research for a better understanding of communication technologies to harness the signals for better performing communication systems using various wired and/or wireless media. This Journal is particularly interested in research papers reporting novel solutions to the dominating problems of noise, interference, timing and errors for reduction systems deficiencies such as wasting scarce resources such as spectra, energy and bandwidth.
Topics include, but are not limited to:
Coding and Communication Theory;
Modulation and Signal Design;
Wired, Wireless and Optical Communication;
Communication System
Special Issues. Current Call for Papers:
Cognitive and AI-enabled Wireless and Mobile - https://digital-library.theiet.org/files/IET_COM_CFP_CAWM.pdf
UAV-Enabled Mobile Edge Computing - https://digital-library.theiet.org/files/IET_COM_CFP_UAV.pdf