A New Searchable Variable-to-Variable Compressor

N. Brisaboa, A. Fariña, Juan R. Lopez, G. Navarro, Eduardo Rodríguez López
{"title":"A New Searchable Variable-to-Variable Compressor","authors":"N. Brisaboa, A. Fariña, Juan R. Lopez, G. Navarro, Eduardo Rodríguez López","doi":"10.1109/DCC.2010.25","DOIUrl":null,"url":null,"abstract":"Word-based compression over natural language text has shown to be a good choice to trade compression ratio and speed, obtaining compression ratios close to 30% and very fast decompression. Additionally, it permits fast searches over the compressed text using Boyer-Moore type algorithms. Such compressors are based on processing fixed source symbols (words) and assigning them variable-byte-length codewords, thus following a fixed-to-variable approach. We present a new variable-to-variable compressor (v2vdc) that uses words and phrases as the source symbols, which are encoded with a variable-length scheme. The phrases are chosen using the longest common prefix information on the suffix array of the text, so as to favor long and frequent phrases. We obtain compression ratios close to those of p7zip and ppmdi, overcoming bzip2, and 8-10 percentage points less than the equivalent word-based compressor. V2vdc is in addition among the fastest to decompress, and allows efficient direct search of the compressed text, in some cases the fastest to date as well.","PeriodicalId":299459,"journal":{"name":"2010 Data Compression Conference","volume":"53 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 Data Compression Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCC.2010.25","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 14

Abstract

Word-based compression over natural language text has shown to be a good choice to trade compression ratio and speed, obtaining compression ratios close to 30% and very fast decompression. Additionally, it permits fast searches over the compressed text using Boyer-Moore type algorithms. Such compressors are based on processing fixed source symbols (words) and assigning them variable-byte-length codewords, thus following a fixed-to-variable approach. We present a new variable-to-variable compressor (v2vdc) that uses words and phrases as the source symbols, which are encoded with a variable-length scheme. The phrases are chosen using the longest common prefix information on the suffix array of the text, so as to favor long and frequent phrases. We obtain compression ratios close to those of p7zip and ppmdi, overcoming bzip2, and 8-10 percentage points less than the equivalent word-based compressor. V2vdc is in addition among the fastest to decompress, and allows efficient direct search of the compressed text, in some cases the fastest to date as well.
一种新的可搜索变量对变量压缩器
基于单词的自然语言文本压缩已被证明是权衡压缩比和速度的好选择,可以获得接近30%的压缩比和非常快的解压。此外,它允许使用Boyer-Moore类型算法对压缩文本进行快速搜索。这样的压缩器基于处理固定的源符号(字)并为它们分配可变字节长度的码字,因此遵循固定到变量的方法。我们提出了一种新的变量对变量压缩器(v2vdc),它使用单词和短语作为源符号,并使用变长方案进行编码。使用文本后缀数组中最长的公共前缀信息来选择短语,以便选择长而频繁的短语。我们获得的压缩比接近于p7zip和ppmdi,克服了bzip2,并且比等效的基于单词的压缩器低8-10个百分点。此外,V2vdc是最快的解压缩程序之一,并允许对压缩文本进行有效的直接搜索,在某些情况下也是迄今为止最快的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信