A general-purpose compression scheme for databases

A. Cannane, H. Williams, J. Zobel
{"title":"A general-purpose compression scheme for databases","authors":"A. Cannane, H. Williams, J. Zobel","doi":"10.1109/DCC.1999.785676","DOIUrl":null,"url":null,"abstract":"Summary form only given. Current adaptive compression schemes such as GZIP and COMPRESS are impractical for database compression as they do not allow random access to individual records. A compression algorithm for general-purpose database systems must address the problem of randomly accessing and individually decompressing records, while maintaining compact storage of data. The SEQUITUR algorithm of Nevill-Manning et al., (1994, 1996, 1997) also adaptively compresses data, achieving excellent compression but with significant main-memory requirements. A preliminary version of SEQUITUR used a semi-static modelling approach to achieve slightly worse compression than the adaptive approach. We describe a new variant of the semi-static SEQUITUR algorithm, RAY, that reduces main-memory use and allows random-access to databases. RAY models repetition in sequences by progressively constructing a hierarchical grammar with multiple passes through the data. The multiple pass approach of RAY uses statistics on character pair repetition, or digram frequency, to create rules in the grammar. While our preliminary implementation is not especially fast, the multi-pass approach permits reductions in compression time, at the cost of affecting compression performance, by limiting the number of passes. We have found that RAY has practicable main-memory requirements and achieves better compression than an efficient Huffmann scheme and popular adaptive compression techniques. Moreover, our scheme allows random access to data and is not restricted to databases of text.","PeriodicalId":103598,"journal":{"name":"Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1999-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCC.1999.785676","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

Abstract

Summary form only given. Current adaptive compression schemes such as GZIP and COMPRESS are impractical for database compression as they do not allow random access to individual records. A compression algorithm for general-purpose database systems must address the problem of randomly accessing and individually decompressing records, while maintaining compact storage of data. The SEQUITUR algorithm of Nevill-Manning et al., (1994, 1996, 1997) also adaptively compresses data, achieving excellent compression but with significant main-memory requirements. A preliminary version of SEQUITUR used a semi-static modelling approach to achieve slightly worse compression than the adaptive approach. We describe a new variant of the semi-static SEQUITUR algorithm, RAY, that reduces main-memory use and allows random-access to databases. RAY models repetition in sequences by progressively constructing a hierarchical grammar with multiple passes through the data. The multiple pass approach of RAY uses statistics on character pair repetition, or digram frequency, to create rules in the grammar. While our preliminary implementation is not especially fast, the multi-pass approach permits reductions in compression time, at the cost of affecting compression performance, by limiting the number of passes. We have found that RAY has practicable main-memory requirements and achieves better compression than an efficient Huffmann scheme and popular adaptive compression techniques. Moreover, our scheme allows random access to data and is not restricted to databases of text.
用于数据库的通用压缩方案
只提供摘要形式。当前的自适应压缩方案(如GZIP和COMPRESS)对于数据库压缩是不切实际的,因为它们不允许随机访问单个记录。通用数据库系统的压缩算法必须解决随机访问和单独解压缩记录的问题,同时保持数据的紧凑存储。neville - manning等人(1994,1996,1997)的SEQUITUR算法也自适应压缩数据,虽然压缩效果很好,但对主存的要求很高。SEQUITUR的初步版本使用半静态建模方法来实现比自适应方法稍差的压缩。我们描述了半静态SEQUITUR算法的一个新变体,RAY,它减少了主内存的使用,并允许随机访问数据库。RAY通过逐步构建具有多次遍历数据的分层语法来对序列中的重复进行建模。RAY的多次传递方法使用有关字符对重复或图频率的统计信息来创建语法中的规则。虽然我们的初步实现不是特别快,但通过限制传递次数,以影响压缩性能为代价,多通道方法可以减少压缩时间。我们发现RAY具有实际的主存需求,并且比高效的霍夫曼方案和流行的自适应压缩技术实现更好的压缩。此外,我们的方案允许随机访问数据,而不局限于文本数据库。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信