{"title":"Data Deduplication with Random Substitutions","authors":"Hao Lou, Farzad Farnoud","doi":"10.1109/ISIT44484.2020.9174380","DOIUrl":null,"url":null,"abstract":"Data deduplication saves storage space by identifying and removing repeats in the data stream. In this paper, we provide an information-theoretic analysis of the performance of deduplication algorithms with data streams where repeats are not exact. We introduce a source model in which probabilistic substitutions are considered. Two modified versions of fixed-length deduplication are studied and proven to have performance within a constant factor of optimal with the knowledge of repeat length. We also study the variable-length scheme and show that as entropy becomes smaller, the size of the compressed string vanishes relative to the length of the uncompressed string.","PeriodicalId":159311,"journal":{"name":"2020 IEEE International Symposium on Information Theory (ISIT)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE International Symposium on Information Theory (ISIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISIT44484.2020.9174380","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Data deduplication saves storage space by identifying and removing repeats in the data stream. In this paper, we provide an information-theoretic analysis of the performance of deduplication algorithms with data streams where repeats are not exact. We introduce a source model in which probabilistic substitutions are considered. Two modified versions of fixed-length deduplication are studied and proven to have performance within a constant factor of optimal with the knowledge of repeat length. We also study the variable-length scheme and show that as entropy becomes smaller, the size of the compressed string vanishes relative to the length of the uncompressed string.