Compressing relations and indexes

Proceedings 14th International Conference on Data Engineering Pub Date : 1998-02-23 DOI:10.1109/ICDE.1998.655800

J. Goldstein, R. Ramakrishnan, U. Shaft

{"title":"Compressing relations and indexes","authors":"J. Goldstein, R. Ramakrishnan, U. Shaft","doi":"10.1109/ICDE.1998.655800","DOIUrl":null,"url":null,"abstract":"We propose a new compression algorithm that is tailored to database applications. It can be applied to a collection of records, and is especially effective for records with many low to medium cardinality fields and numeric fields. In addition, this new technique supports very fast decompression. Promising application domains include decision support systems (DSS), since fact tables, which are by far the largest tables in these applications, contain many low and medium cardinality fields and typically no text fields. Further, our decompression rates are faster than typical disk throughputs for sequential scans; in contrast, gzip is slower. This is important in DSS applications, which often scan large ranges of records. An important distinguishing characteristic of our algorithm, in contrast to compression algorithms proposed earlier, is that we can decompress individual tuples (even individual fields), rather than a full page (or an entire relation) at a time. Also, all the information needed for tuple decompression resides on the same page with the tuple. This means that a page can be stored in the buffer pool and used in compressed form, simplifying the job of the buffer manager and improving memory utilization. Our compression algorithm also improves index structures such as B-trees and R-trees significantly by reducing the number of leaf pages and compressing index entries, which greatly increases the fan-out. We can also use lossy compression on the internal nodes of an index.","PeriodicalId":264926,"journal":{"name":"Proceedings 14th International Conference on Data Engineering","volume":"51 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1998-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"229","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings 14th International Conference on Data Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.1998.655800","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 229

Abstract

We propose a new compression algorithm that is tailored to database applications. It can be applied to a collection of records, and is especially effective for records with many low to medium cardinality fields and numeric fields. In addition, this new technique supports very fast decompression. Promising application domains include decision support systems (DSS), since fact tables, which are by far the largest tables in these applications, contain many low and medium cardinality fields and typically no text fields. Further, our decompression rates are faster than typical disk throughputs for sequential scans; in contrast, gzip is slower. This is important in DSS applications, which often scan large ranges of records. An important distinguishing characteristic of our algorithm, in contrast to compression algorithms proposed earlier, is that we can decompress individual tuples (even individual fields), rather than a full page (or an entire relation) at a time. Also, all the information needed for tuple decompression resides on the same page with the tuple. This means that a page can be stored in the buffer pool and used in compressed form, simplifying the job of the buffer manager and improving memory utilization. Our compression algorithm also improves index structures such as B-trees and R-trees significantly by reducing the number of leaf pages and compressing index entries, which greatly increases the fan-out. We can also use lossy compression on the internal nodes of an index.

查看原文本刊更多论文

压缩关系和索引

我们提出了一种适合数据库应用的新的压缩算法。它可以应用于记录集合，并且对于具有许多低到中等基数字段和数字字段的记录特别有效。此外，这种新技术支持非常快的解压缩。有前途的应用领域包括决策支持系统(DSS)，因为事实表(这些应用程序中迄今为止最大的表)包含许多低基数和中等基数字段，通常没有文本字段。此外，我们的解压缩速率比顺序扫描的典型磁盘吞吐量更快;相比之下，gzip要慢一些。这在经常扫描大范围记录的DSS应用程序中很重要。与前面提出的压缩算法相比，我们算法的一个重要区别特征是，我们可以一次解压缩单个元组(甚至单个字段)，而不是整个页面(或整个关系)。此外，元组解压缩所需的所有信息都与元组位于同一页面上。这意味着可以将页面存储在缓冲池中并以压缩形式使用，从而简化了缓冲区管理器的工作并提高了内存利用率。我们的压缩算法还通过减少叶页的数量和压缩索引条目来显著改进索引结构，如b树和r树，这大大增加了扇出。我们也可以对索引的内部节点使用有损压缩。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings 14th International Conference on Data Engineering

自引率

0.00%

发文量