{"title":"Natural Language Compression Optimized for Large Set of Files","authors":"P. Procházka, J. Holub","doi":"10.1109/DCC.2013.93","DOIUrl":null,"url":null,"abstract":"Summary form only given. The web search engines store the web pages in the raw text form to build so called snippets (short text surrounding the searched pattern) or to perform so called positional ranking functions. We address the problem of the compression of a large collection of text files distributed in cluster of computers, where the single files need to be randomly accessed in very short time. The compression algorithm Set-of-Files Semi-Adaptive Two Byte Dense Code (SF-STBDC) is based on the word-based approach and the idea of combination of two statistical models: the global model (common for all the files of the set) and the local model. The latter is built as the set of changes which transform the global model to the proper model of the single compressed file. Except very good compression ratio the compression method allows fast searching on the compressed text, which is an attractive property especially for search engines property especially for search engines. Exactly the same problem (compression of a set of files using byte codes) was first stated in. Our algorithm SF-STBDC overcomes the algorithm based on (s,c) - Dense Code in compression ratio and at the same time it keeps a very good searching and decompression speed. The key idea to achieve this result is a usage of Semi-Adaptive Two Byte Dense Code which provides more effective coding of small portions ofof the text and still allows exact setting of the number of stoppers and continuers.","PeriodicalId":388717,"journal":{"name":"2013 Data Compression Conference","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 Data Compression Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCC.2013.93","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Summary form only given. The web search engines store the web pages in the raw text form to build so called snippets (short text surrounding the searched pattern) or to perform so called positional ranking functions. We address the problem of the compression of a large collection of text files distributed in cluster of computers, where the single files need to be randomly accessed in very short time. The compression algorithm Set-of-Files Semi-Adaptive Two Byte Dense Code (SF-STBDC) is based on the word-based approach and the idea of combination of two statistical models: the global model (common for all the files of the set) and the local model. The latter is built as the set of changes which transform the global model to the proper model of the single compressed file. Except very good compression ratio the compression method allows fast searching on the compressed text, which is an attractive property especially for search engines property especially for search engines. Exactly the same problem (compression of a set of files using byte codes) was first stated in. Our algorithm SF-STBDC overcomes the algorithm based on (s,c) - Dense Code in compression ratio and at the same time it keeps a very good searching and decompression speed. The key idea to achieve this result is a usage of Semi-Adaptive Two Byte Dense Code which provides more effective coding of small portions ofof the text and still allows exact setting of the number of stoppers and continuers.