紧凑的内存模型用于压缩大型文本数据库

6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268) Pub Date : 1999-09-21 DOI:10.1109/SPIRE.1999.796599

J. Zobel, H. Williams

{"title":"紧凑的内存模型用于压缩大型文本数据库","authors":"J. Zobel, H. Williams","doi":"10.1109/SPIRE.1999.796599","DOIUrl":null,"url":null,"abstract":"For compression of text databases, semi-static word based models are a pragmatic choice. Previous experiments have shown that, where there is not sufficient memory to store a full word based model, encoding rare words as sequences of characters can still allow good compression, while a pure character based model is poor. We propose a further kind of model that reduces main memory costs: approximate models, in which rare words are represented by similarly spelt common words and a sequence of edits. We investigate the compression available with different models, including characters, words, word pairs, and edits, and with combinations of these approaches. We show experimentally that carefully chosen combinations of models can improve the compression available in limited memory and greatly reduce overall memory requirements.","PeriodicalId":131279,"journal":{"name":"6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1999-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Compact in-memory models for compression of large text databases\",\"authors\":\"J. Zobel, H. Williams\",\"doi\":\"10.1109/SPIRE.1999.796599\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"For compression of text databases, semi-static word based models are a pragmatic choice. Previous experiments have shown that, where there is not sufficient memory to store a full word based model, encoding rare words as sequences of characters can still allow good compression, while a pure character based model is poor. We propose a further kind of model that reduces main memory costs: approximate models, in which rare words are represented by similarly spelt common words and a sequence of edits. We investigate the compression available with different models, including characters, words, word pairs, and edits, and with combinations of these approaches. We show experimentally that carefully chosen combinations of models can improve the compression available in limited memory and greatly reduce overall memory requirements.\",\"PeriodicalId\":131279,\"journal\":{\"name\":\"6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268)\",\"volume\":\"55 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1999-09-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SPIRE.1999.796599\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SPIRE.1999.796599","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

摘要

对于文本数据库的压缩，半静态的基于单词的模型是一种实用的选择。先前的实验表明，在没有足够的内存来存储完整的基于单词的模型的情况下，将罕见的单词编码为字符序列仍然可以进行良好的压缩，而纯基于字符的模型则很差。我们提出了一种进一步降低主存储器成本的模型:近似模型，其中罕见词由拼写相似的常用词和一系列编辑表示。我们研究了不同模型的可用压缩，包括字符、单词、单词对和编辑，以及这些方法的组合。我们通过实验证明，精心选择的模型组合可以提高有限内存中的可用压缩，并大大降低总体内存需求。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Compact in-memory models for compression of large text databases

For compression of text databases, semi-static word based models are a pragmatic choice. Previous experiments have shown that, where there is not sufficient memory to store a full word based model, encoding rare words as sequences of characters can still allow good compression, while a pure character based model is poor. We propose a further kind of model that reduces main memory costs: approximate models, in which rare words are represented by similarly spelt common words and a sequence of edits. We investigate the compression available with different models, including characters, words, word pairs, and edits, and with combinations of these approaches. We show experimentally that carefully chosen combinations of models can improve the compression available in limited memory and greatly reduce overall memory requirements.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268)

自引率

0.00%

发文量