A general-purpose compression scheme for databases

Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096) Pub Date : 1999-03-29 DOI:10.1109/DCC.1999.785676

A. Cannane, H. Williams, J. Zobel

{"title":"A general-purpose compression scheme for databases","authors":"A. Cannane, H. Williams, J. Zobel","doi":"10.1109/DCC.1999.785676","DOIUrl":null,"url":null,"abstract":"Summary form only given. Current adaptive compression schemes such as GZIP and COMPRESS are impractical for database compression as they do not allow random access to individual records. A compression algorithm for general-purpose database systems must address the problem of randomly accessing and individually decompressing records, while maintaining compact storage of data. The SEQUITUR algorithm of Nevill-Manning et al., (1994, 1996, 1997) also adaptively compresses data, achieving excellent compression but with significant main-memory requirements. A preliminary version of SEQUITUR used a semi-static modelling approach to achieve slightly worse compression than the adaptive approach. We describe a new variant of the semi-static SEQUITUR algorithm, RAY, that reduces main-memory use and allows random-access to databases. RAY models repetition in sequences by progressively constructing a hierarchical grammar with multiple passes through the data. The multiple pass approach of RAY uses statistics on character pair repetition, or digram frequency, to create rules in the grammar. While our preliminary implementation is not especially fast, the multi-pass approach permits reductions in compression time, at the cost of affecting compression performance, by limiting the number of passes. We have found that RAY has practicable main-memory requirements and achieves better compression than an efficient Huffmann scheme and popular adaptive compression techniques. Moreover, our scheme allows random access to data and is not restricted to databases of text.","PeriodicalId":103598,"journal":{"name":"Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1999-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCC.1999.785676","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

Summary form only given. Current adaptive compression schemes such as GZIP and COMPRESS are impractical for database compression as they do not allow random access to individual records. A compression algorithm for general-purpose database systems must address the problem of randomly accessing and individually decompressing records, while maintaining compact storage of data. The SEQUITUR algorithm of Nevill-Manning et al., (1994, 1996, 1997) also adaptively compresses data, achieving excellent compression but with significant main-memory requirements. A preliminary version of SEQUITUR used a semi-static modelling approach to achieve slightly worse compression than the adaptive approach. We describe a new variant of the semi-static SEQUITUR algorithm, RAY, that reduces main-memory use and allows random-access to databases. RAY models repetition in sequences by progressively constructing a hierarchical grammar with multiple passes through the data. The multiple pass approach of RAY uses statistics on character pair repetition, or digram frequency, to create rules in the grammar. While our preliminary implementation is not especially fast, the multi-pass approach permits reductions in compression time, at the cost of affecting compression performance, by limiting the number of passes. We have found that RAY has practicable main-memory requirements and achieves better compression than an efficient Huffmann scheme and popular adaptive compression techniques. Moreover, our scheme allows random access to data and is not restricted to databases of text.

查看原文本刊更多论文

用于数据库的通用压缩方案

只提供摘要形式。当前的自适应压缩方案(如GZIP和COMPRESS)对于数据库压缩是不切实际的，因为它们不允许随机访问单个记录。通用数据库系统的压缩算法必须解决随机访问和单独解压缩记录的问题，同时保持数据的紧凑存储。neville - manning等人(1994,1996,1997)的SEQUITUR算法也自适应压缩数据，虽然压缩效果很好，但对主存的要求很高。SEQUITUR的初步版本使用半静态建模方法来实现比自适应方法稍差的压缩。我们描述了半静态SEQUITUR算法的一个新变体，RAY，它减少了主内存的使用，并允许随机访问数据库。RAY通过逐步构建具有多次遍历数据的分层语法来对序列中的重复进行建模。RAY的多次传递方法使用有关字符对重复或图频率的统计信息来创建语法中的规则。虽然我们的初步实现不是特别快，但通过限制传递次数，以影响压缩性能为代价，多通道方法可以减少压缩时间。我们发现RAY具有实际的主存需求，并且比高效的霍夫曼方案和流行的自适应压缩技术实现更好的压缩。此外，我们的方案允许随机访问数据，而不局限于文本数据库。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)

自引率

0.00%

发文量