Natural Language Compression Optimized for Large Set of Files

2013 Data Compression Conference Pub Date : 2013-03-20 DOI:10.1109/DCC.2013.93

P. Procházka, J. Holub

{"title":"Natural Language Compression Optimized for Large Set of Files","authors":"P. Procházka, J. Holub","doi":"10.1109/DCC.2013.93","DOIUrl":null,"url":null,"abstract":"Summary form only given. The web search engines store the web pages in the raw text form to build so called snippets (short text surrounding the searched pattern) or to perform so called positional ranking functions. We address the problem of the compression of a large collection of text files distributed in cluster of computers, where the single files need to be randomly accessed in very short time. The compression algorithm Set-of-Files Semi-Adaptive Two Byte Dense Code (SF-STBDC) is based on the word-based approach and the idea of combination of two statistical models: the global model (common for all the files of the set) and the local model. The latter is built as the set of changes which transform the global model to the proper model of the single compressed file. Except very good compression ratio the compression method allows fast searching on the compressed text, which is an attractive property especially for search engines property especially for search engines. Exactly the same problem (compression of a set of files using byte codes) was first stated in. Our algorithm SF-STBDC overcomes the algorithm based on (s,c) - Dense Code in compression ratio and at the same time it keeps a very good searching and decompression speed. The key idea to achieve this result is a usage of Semi-Adaptive Two Byte Dense Code which provides more effective coding of small portions ofof the text and still allows exact setting of the number of stoppers and continuers.","PeriodicalId":388717,"journal":{"name":"2013 Data Compression Conference","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 Data Compression Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCC.2013.93","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Summary form only given. The web search engines store the web pages in the raw text form to build so called snippets (short text surrounding the searched pattern) or to perform so called positional ranking functions. We address the problem of the compression of a large collection of text files distributed in cluster of computers, where the single files need to be randomly accessed in very short time. The compression algorithm Set-of-Files Semi-Adaptive Two Byte Dense Code (SF-STBDC) is based on the word-based approach and the idea of combination of two statistical models: the global model (common for all the files of the set) and the local model. The latter is built as the set of changes which transform the global model to the proper model of the single compressed file. Except very good compression ratio the compression method allows fast searching on the compressed text, which is an attractive property especially for search engines property especially for search engines. Exactly the same problem (compression of a set of files using byte codes) was first stated in. Our algorithm SF-STBDC overcomes the algorithm based on (s,c) - Dense Code in compression ratio and at the same time it keeps a very good searching and decompression speed. The key idea to achieve this result is a usage of Semi-Adaptive Two Byte Dense Code which provides more effective coding of small portions ofof the text and still allows exact setting of the number of stoppers and continuers.

查看原文本刊更多论文

为大型文件集优化的自然语言压缩

只提供摘要形式。网络搜索引擎以原始文本形式存储网页，以构建所谓的片段(围绕搜索模式的短文本)或执行所谓的位置排名功能。我们解决了分布在计算机集群中的大量文本文件的压缩问题，其中单个文件需要在很短的时间内随机访问。文件集半自适应两字节密集码(SF-STBDC)压缩算法基于基于词的方法和两种统计模型的组合思想:全局模型(用于集合的所有文件)和局部模型。后者被构建为将全局模型转换为单个压缩文件的适当模型的更改集。除了非常好的压缩比，压缩方法允许快速搜索压缩文本，这是一个有吸引力的属性，特别是对搜索引擎。完全相同的问题(使用字节码压缩一组文件)在。我们的算法SF-STBDC在压缩比上克服了基于(s,c) - Dense Code的算法，同时保持了非常好的搜索解压缩速度。实现这一结果的关键思想是使用半自适应双字节密集代码，它为文本的一小部分提供更有效的编码，并且仍然允许精确设置停止和连续的数量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2013 Data Compression Conference

自引率

0.00%

发文量