The wavelet trie: maintaining an indexed sequence of strings in compressed space

Proceedings of the ... ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems Pub Date : 2012-04-16 DOI:10.1145/2213556.2213586

R. Grossi, G. Ottaviano

{"title":"The wavelet trie: maintaining an indexed sequence of strings in compressed space","authors":"R. Grossi, G. Ottaviano","doi":"10.1145/2213556.2213586","DOIUrl":null,"url":null,"abstract":"An indexed sequence of strings is a data structure for storing a string sequence that supports random access, searching, range counting and analytics operations, both for exact matches and prefix search. String sequences lie at the core of column-oriented databases, log processing, and other storage and query tasks. In these applications each string can appear several times and the order of the strings in the sequence is relevant. The prefix structure of the strings is relevant as well: common prefixes are sought in strings to extract interesting features from the sequence. Moreover, space-efficiency is highly desirable as it translates directly into higher performance, since more data can fit in fast memory.\n We introduce and study the problem of compressed indexed sequence of strings, representing indexed sequences of strings in nearly-optimal compressed space, both in the static and dynamic settings, while preserving provably good performance for the supported operations.\n We present a new data structure for this problem, the Wavelet Trie, which combines the classical Patricia Trie with the Wavelet Tree, a succinct data structure for storing a compressed sequence. The resulting Wavelet Trie smoothly adapts to a sequence of strings that changes over time. It improves on the state-of-the-art compressed data structures by supporting a dynamic alphabet (i.e. the set of distinct strings) and prefix queries, both crucial requirements in the aforementioned applications, and on traditional indexes by reducing space occupancy to close to the entropy of the sequence.","PeriodicalId":92118,"journal":{"name":"Proceedings of the ... ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems","volume":"7 1","pages":"203-214"},"PeriodicalIF":0.0000,"publicationDate":"2012-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"30","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ... ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2213556.2213586","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 30

Abstract

An indexed sequence of strings is a data structure for storing a string sequence that supports random access, searching, range counting and analytics operations, both for exact matches and prefix search. String sequences lie at the core of column-oriented databases, log processing, and other storage and query tasks. In these applications each string can appear several times and the order of the strings in the sequence is relevant. The prefix structure of the strings is relevant as well: common prefixes are sought in strings to extract interesting features from the sequence. Moreover, space-efficiency is highly desirable as it translates directly into higher performance, since more data can fit in fast memory. We introduce and study the problem of compressed indexed sequence of strings, representing indexed sequences of strings in nearly-optimal compressed space, both in the static and dynamic settings, while preserving provably good performance for the supported operations. We present a new data structure for this problem, the Wavelet Trie, which combines the classical Patricia Trie with the Wavelet Tree, a succinct data structure for storing a compressed sequence. The resulting Wavelet Trie smoothly adapts to a sequence of strings that changes over time. It improves on the state-of-the-art compressed data structures by supporting a dynamic alphabet (i.e. the set of distinct strings) and prefix queries, both crucial requirements in the aforementioned applications, and on traditional indexes by reducing space occupancy to close to the entropy of the sequence.

查看原文本刊更多论文

小波树:在压缩空间中维护一个索引的字符串序列

字符串的索引序列是一种数据结构，用于存储支持随机访问、搜索、范围计数和分析操作的字符串序列，包括精确匹配和前缀搜索。字符串序列是面向列的数据库、日志处理以及其他存储和查询任务的核心。在这些应用程序中，每个字符串可以出现多次，并且序列中的字符串顺序是相关的。字符串的前缀结构也是相关的:在字符串中寻找公共前缀以从序列中提取有趣的特征。此外，空间效率是非常可取的，因为它可以直接转化为更高的性能，因为更多的数据可以放入快速内存中。我们引入并研究了字符串的压缩索引序列问题，在静态和动态设置下，在几乎最优压缩空间中表示字符串的索引序列，同时为所支持的操作保留了可证明的良好性能。针对这个问题，我们提出了一种新的数据结构，小波树，它结合了经典的Patricia Trie和小波树，一种用于存储压缩序列的简洁数据结构。由此产生的小波Trie平滑地适应随时间变化的字符串序列。它通过支持动态字母表(即不同字符串的集合)和前缀查询(上述应用程序中的两个关键需求)改进了最先进的压缩数据结构，并通过减少空间占用来接近序列的熵来改进传统索引。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the ... ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems

CiteScore

4.40

自引率

0.00%

发文量