对文件集合进行基于集群的增量压缩

Z. Ouyang, N. Memon, Torsten Suel, Dimitre Trendafilov
{"title":"对文件集合进行基于集群的增量压缩","authors":"Z. Ouyang, N. Memon, Torsten Suel, Dimitre Trendafilov","doi":"10.1109/WISE.2002.1181662","DOIUrl":null,"url":null,"abstract":"Delta compression techniques are commonly used to succinctly represent an updated version of a file with respect to an earlier one. We study the use of delta compression in a somewhat different scenario, where we wish to compress a large collection of (more or less) related files by performing a sequence of pairwise delta compressions. The problem of finding an optimal delta encoding for a collection of files by taking pairwise deltas can be reduced to the problem of computing a branching of maximum weight in a weighted directed graph, but this solution is inefficient and thus does not scale to larger file collections. This motivates us to propose a framework for cluster-based delta compression that uses text clustering techniques to prune the graph of possible pairwise delta encodings. To demonstrate the efficacy of our approach, we present experimental results on collections of Web pages. Our experiments show that cluster-based delta compression of collections provides significant improvements in compression ratio as compared to individually compressing each file or using tar+gzip, at a moderate cost in efficiency.","PeriodicalId":392999,"journal":{"name":"Proceedings of the Third International Conference on Web Information Systems Engineering, 2002. WISE 2002.","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2002-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"58","resultStr":"{\"title\":\"Cluster-based delta compression of a collection of files\",\"authors\":\"Z. Ouyang, N. Memon, Torsten Suel, Dimitre Trendafilov\",\"doi\":\"10.1109/WISE.2002.1181662\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Delta compression techniques are commonly used to succinctly represent an updated version of a file with respect to an earlier one. We study the use of delta compression in a somewhat different scenario, where we wish to compress a large collection of (more or less) related files by performing a sequence of pairwise delta compressions. The problem of finding an optimal delta encoding for a collection of files by taking pairwise deltas can be reduced to the problem of computing a branching of maximum weight in a weighted directed graph, but this solution is inefficient and thus does not scale to larger file collections. This motivates us to propose a framework for cluster-based delta compression that uses text clustering techniques to prune the graph of possible pairwise delta encodings. To demonstrate the efficacy of our approach, we present experimental results on collections of Web pages. Our experiments show that cluster-based delta compression of collections provides significant improvements in compression ratio as compared to individually compressing each file or using tar+gzip, at a moderate cost in efficiency.\",\"PeriodicalId\":392999,\"journal\":{\"name\":\"Proceedings of the Third International Conference on Web Information Systems Engineering, 2002. WISE 2002.\",\"volume\":\"33 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2002-12-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"58\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the Third International Conference on Web Information Systems Engineering, 2002. WISE 2002.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/WISE.2002.1181662\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Third International Conference on Web Information Systems Engineering, 2002. WISE 2002.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WISE.2002.1181662","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 58

摘要

增量压缩技术通常用于简洁地表示文件相对于较早版本的更新版本。我们在一个稍有不同的场景中研究增量压缩的使用,在这个场景中,我们希望通过执行一系列成对增量压缩来压缩大量(或多或少)相关的文件。通过采取两两的增量来为文件集合寻找最佳增量编码的问题可以简化为计算加权有向图中最大权重的分支的问题,但这种解决方案效率低下,因此不能扩展到更大的文件集合。这促使我们提出了一个基于聚类的增量压缩框架,该框架使用文本聚类技术来修剪可能的成对增量编码的图。为了证明我们的方法的有效性,我们给出了Web页面集合的实验结果。我们的实验表明,与单独压缩每个文件或使用tar+gzip相比,基于集群的集合增量压缩在压缩比方面提供了显著的改进,而且效率的成本适中。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Cluster-based delta compression of a collection of files
Delta compression techniques are commonly used to succinctly represent an updated version of a file with respect to an earlier one. We study the use of delta compression in a somewhat different scenario, where we wish to compress a large collection of (more or less) related files by performing a sequence of pairwise delta compressions. The problem of finding an optimal delta encoding for a collection of files by taking pairwise deltas can be reduced to the problem of computing a branching of maximum weight in a weighted directed graph, but this solution is inefficient and thus does not scale to larger file collections. This motivates us to propose a framework for cluster-based delta compression that uses text clustering techniques to prune the graph of possible pairwise delta encodings. To demonstrate the efficacy of our approach, we present experimental results on collections of Web pages. Our experiments show that cluster-based delta compression of collections provides significant improvements in compression ratio as compared to individually compressing each file or using tar+gzip, at a moderate cost in efficiency.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信