基于字典的语料库压缩的样本选择

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval Pub Date : 2011-07-24 DOI:10.1145/2009916.2010087

C. Hoobin, S. Puglisi, J. Zobel

{"title":"基于字典的语料库压缩的样本选择","authors":"C. Hoobin, S. Puglisi, J. Zobel","doi":"10.1145/2009916.2010087","DOIUrl":null,"url":null,"abstract":"Compression of large text corpora has the potential to drastically reduce both storage requirements and per-document access costs. Adaptive methods used for general-purpose compression are ineffective for this application, and historically the most successful methods have been based on word-based dictionaries, which allow use of global properties of the text. However, these are dependent on the text complying with assumptions about content and lead to dictionaries of unpredictable size. In recent work we have described an LZ-like approach in which sampled blocks of a corpus are used as a dictionary against which the complete corpus is compressed, giving compression twice as effective than that of zlib. Here we explore how pre-processing can be used to eliminate redundancy in our sampled dictionary. Our experiments show that dictionary size can be reduced by 50% or more (less than 0.1% of the collection size) with no significant effect on compression or access speed.","PeriodicalId":356580,"journal":{"name":"Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"Sample selection for dictionary-based corpus compression\",\"authors\":\"C. Hoobin, S. Puglisi, J. Zobel\",\"doi\":\"10.1145/2009916.2010087\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Compression of large text corpora has the potential to drastically reduce both storage requirements and per-document access costs. Adaptive methods used for general-purpose compression are ineffective for this application, and historically the most successful methods have been based on word-based dictionaries, which allow use of global properties of the text. However, these are dependent on the text complying with assumptions about content and lead to dictionaries of unpredictable size. In recent work we have described an LZ-like approach in which sampled blocks of a corpus are used as a dictionary against which the complete corpus is compressed, giving compression twice as effective than that of zlib. Here we explore how pre-processing can be used to eliminate redundancy in our sampled dictionary. Our experiments show that dictionary size can be reduced by 50% or more (less than 0.1% of the collection size) with no significant effect on compression or access speed.\",\"PeriodicalId\":356580,\"journal\":{\"name\":\"Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval\",\"volume\":\"8 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-07-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2009916.2010087\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2009916.2010087","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

摘要

对大型文本语料库进行压缩有可能大幅降低存储需求和每个文档的访问成本。用于通用压缩的自适应方法对于这个应用程序是无效的，历史上最成功的方法是基于基于单词的字典，它允许使用文本的全局属性。然而，这些依赖于文本遵守关于内容的假设，并导致字典的大小不可预测。在最近的工作中，我们描述了一种类似于zlib的方法，在这种方法中，语料库的采样块被用作字典，整个语料库被压缩，压缩效率是zlib的两倍。在这里，我们将探讨如何使用预处理来消除采样字典中的冗余。我们的实验表明，字典大小可以减少50%或更多(小于集合大小的0.1%)，而对压缩或访问速度没有显著影响。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Sample selection for dictionary-based corpus compression

Compression of large text corpora has the potential to drastically reduce both storage requirements and per-document access costs. Adaptive methods used for general-purpose compression are ineffective for this application, and historically the most successful methods have been based on word-based dictionaries, which allow use of global properties of the text. However, these are dependent on the text complying with assumptions about content and lead to dictionaries of unpredictable size. In recent work we have described an LZ-like approach in which sampled blocks of a corpus are used as a dictionary against which the complete corpus is compressed, giving compression twice as effective than that of zlib. Here we explore how pre-processing can be used to eliminate redundancy in our sampled dictionary. Our experiments show that dictionary size can be reduced by 50% or more (less than 0.1% of the collection size) with no significant effect on compression or access speed.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval

自引率

0.00%

发文量