Natural Language Compression Optimized for Large Set of Files

P. Procházka, J. Holub
{"title":"Natural Language Compression Optimized for Large Set of Files","authors":"P. Procházka, J. Holub","doi":"10.1109/DCC.2013.93","DOIUrl":null,"url":null,"abstract":"Summary form only given. The web search engines store the web pages in the raw text form to build so called snippets (short text surrounding the searched pattern) or to perform so called positional ranking functions. We address the problem of the compression of a large collection of text files distributed in cluster of computers, where the single files need to be randomly accessed in very short time. The compression algorithm Set-of-Files Semi-Adaptive Two Byte Dense Code (SF-STBDC) is based on the word-based approach and the idea of combination of two statistical models: the global model (common for all the files of the set) and the local model. The latter is built as the set of changes which transform the global model to the proper model of the single compressed file. Except very good compression ratio the compression method allows fast searching on the compressed text, which is an attractive property especially for search engines property especially for search engines. Exactly the same problem (compression of a set of files using byte codes) was first stated in. Our algorithm SF-STBDC overcomes the algorithm based on (s,c) - Dense Code in compression ratio and at the same time it keeps a very good searching and decompression speed. The key idea to achieve this result is a usage of Semi-Adaptive Two Byte Dense Code which provides more effective coding of small portions ofof the text and still allows exact setting of the number of stoppers and continuers.","PeriodicalId":388717,"journal":{"name":"2013 Data Compression Conference","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 Data Compression Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCC.2013.93","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Summary form only given. The web search engines store the web pages in the raw text form to build so called snippets (short text surrounding the searched pattern) or to perform so called positional ranking functions. We address the problem of the compression of a large collection of text files distributed in cluster of computers, where the single files need to be randomly accessed in very short time. The compression algorithm Set-of-Files Semi-Adaptive Two Byte Dense Code (SF-STBDC) is based on the word-based approach and the idea of combination of two statistical models: the global model (common for all the files of the set) and the local model. The latter is built as the set of changes which transform the global model to the proper model of the single compressed file. Except very good compression ratio the compression method allows fast searching on the compressed text, which is an attractive property especially for search engines property especially for search engines. Exactly the same problem (compression of a set of files using byte codes) was first stated in. Our algorithm SF-STBDC overcomes the algorithm based on (s,c) - Dense Code in compression ratio and at the same time it keeps a very good searching and decompression speed. The key idea to achieve this result is a usage of Semi-Adaptive Two Byte Dense Code which provides more effective coding of small portions ofof the text and still allows exact setting of the number of stoppers and continuers.
为大型文件集优化的自然语言压缩
只提供摘要形式。网络搜索引擎以原始文本形式存储网页,以构建所谓的片段(围绕搜索模式的短文本)或执行所谓的位置排名功能。我们解决了分布在计算机集群中的大量文本文件的压缩问题,其中单个文件需要在很短的时间内随机访问。文件集半自适应两字节密集码(SF-STBDC)压缩算法基于基于词的方法和两种统计模型的组合思想:全局模型(用于集合的所有文件)和局部模型。后者被构建为将全局模型转换为单个压缩文件的适当模型的更改集。除了非常好的压缩比,压缩方法允许快速搜索压缩文本,这是一个有吸引力的属性,特别是对搜索引擎。完全相同的问题(使用字节码压缩一组文件)在。我们的算法SF-STBDC在压缩比上克服了基于(s,c) - Dense Code的算法,同时保持了非常好的搜索解压缩速度。实现这一结果的关键思想是使用半自适应双字节密集代码,它为文本的一小部分提供更有效的编码,并且仍然允许精确设置停止和连续的数量。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信