Higher compression from the Burrows-Wheeler transform by modified sorting

B. Chapin, S. Tate
{"title":"Higher compression from the Burrows-Wheeler transform by modified sorting","authors":"B. Chapin, S. Tate","doi":"10.1109/DCC.1998.672253","DOIUrl":null,"url":null,"abstract":"Summary form only given. The Burrows-Wheeler transform (BWT) compression technique is based on sorting substrings of the input, and has a performance rivalling the best previously known techniques. We show that the ordering used in the sorting stage of the BWT, an aspect hitherto ignored, can have a significant impact on the size of the compressed data. We modify the sorting order in two separate ways. First, we try reordering the symbol alphabet, and doing a standard sort based on the permuted character set. This is particularly interesting because the BWT's sensitivity to alphabet ordering is fairly unique among general-purpose compression schemes. Previous techniques, including statistical techniques (such as the PPM algorithms) and dictionary techniques (represented by LZ77, LZ78, and their descendants), are largely based on pattern matching which is entirely independent of the encoding used for the source alphabet. On files in which the alphabet is arbitrarily ordered, such as ASCII text and certain domain-specific encoding; such as the geo file from the Calgary Compression Corpus, this technique improved the compression ratio of the BWT-based compression algorithm. On the other hand, data which already had a significant alphabet ordering, such as image data, showed little improvement with this technique. The second modified sorting technique was to modify the sorting algorithm itself to order strings in a manner analogous to reflected Gray codes. In particular, we alternated increasing and decreasing order on the second character position, changing whenever the character in the first position changed.","PeriodicalId":191890,"journal":{"name":"Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1998-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"42","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCC.1998.672253","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 42

Abstract

Summary form only given. The Burrows-Wheeler transform (BWT) compression technique is based on sorting substrings of the input, and has a performance rivalling the best previously known techniques. We show that the ordering used in the sorting stage of the BWT, an aspect hitherto ignored, can have a significant impact on the size of the compressed data. We modify the sorting order in two separate ways. First, we try reordering the symbol alphabet, and doing a standard sort based on the permuted character set. This is particularly interesting because the BWT's sensitivity to alphabet ordering is fairly unique among general-purpose compression schemes. Previous techniques, including statistical techniques (such as the PPM algorithms) and dictionary techniques (represented by LZ77, LZ78, and their descendants), are largely based on pattern matching which is entirely independent of the encoding used for the source alphabet. On files in which the alphabet is arbitrarily ordered, such as ASCII text and certain domain-specific encoding; such as the geo file from the Calgary Compression Corpus, this technique improved the compression ratio of the BWT-based compression algorithm. On the other hand, data which already had a significant alphabet ordering, such as image data, showed little improvement with this technique. The second modified sorting technique was to modify the sorting algorithm itself to order strings in a manner analogous to reflected Gray codes. In particular, we alternated increasing and decreasing order on the second character position, changing whenever the character in the first position changed.
改进的排序提高了Burrows-Wheeler变换的压缩率
只提供摘要形式。Burrows-Wheeler变换(BWT)压缩技术基于对输入的子字符串进行排序,其性能可与目前已知的最佳技术相媲美。我们表明,在BWT的排序阶段使用的排序,一个迄今为止被忽视的方面,可以对压缩数据的大小产生重大影响。我们以两种不同的方式修改排序顺序。首先,我们尝试重新排序符号字母表,并根据排列后的字符集进行标准排序。这一点特别有趣,因为BWT对字母表排序的敏感性在通用压缩方案中是相当独特的。以前的技术,包括统计技术(如PPM算法)和字典技术(由LZ77、LZ78及其后代表示),主要基于模式匹配,完全独立于源字母表所使用的编码。对于任意排序字母表的文件,例如ASCII文本和某些特定于域的编码;例如来自Calgary压缩语料库的geo文件,该技术提高了基于bwt的压缩算法的压缩比。另一方面,已经具有重要字母顺序的数据,如图像数据,使用该技术几乎没有改善。第二种改进的排序技术是修改排序算法本身,以类似于反射格雷码的方式对字符串排序。特别是,我们在第二个字符位置上交替递增和递减顺序,每当第一个字符位置发生变化时,顺序就会改变。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信