Zips: mining compressing sequential patterns in streams

Proceedings of the ACM SIGKDD Workshop on Interactive Data Exploration and Analytics Pub Date : 2013-08-11 DOI:10.1145/2501511.2501520

Hoang Thanh Lam, T. Calders, Jie Yang, F. Mörchen, Dmitriy Fradkin

{"title":"Zips: mining compressing sequential patterns in streams","authors":"Hoang Thanh Lam, T. Calders, Jie Yang, F. Mörchen, Dmitriy Fradkin","doi":"10.1145/2501511.2501520","DOIUrl":null,"url":null,"abstract":"We propose a streaming algorithm, based on the minimal description length (MDL) principle, for extracting non-redundant sequential patterns. For static databases, the MDL-based approach that selects patterns based on their capacity to compress data rather than their frequency, was shown to be remarkably effective for extracting meaningful patterns and solving the redundancy issue in frequent itemset and sequence mining. The existing MDL-based algorithms, however, either start from a seed set of frequent patterns, or require multiple passes through the data. As such, the existing approaches scale poorly and are unsuitable for large datasets. Therefore, our main contribution is the proposal of a new, streaming algorithm, called Zips, that does not require a seed set of patterns and requires only one scan over the data. For Zips, we extended the Lempel-Ziv (LZ) compression algorithm in three ways: first, whereas LZ assigns codes uniformly as it builds up its dictionary while scanning the input, Zips assigns codewords according to the usage of the dictionary words; more heaviliy used words get shorter code-lengths. Secondly, Zips exploits also non-consecutive occurences of dictionary words for compression. And, third, the well-known space-saving algorithm is used to evict unpromising words from the dictionary. Experiments on one synthetic and two real-world large-scale datasets show that our approach extracts meaningful compressing patterns with similar quality to the state-of-the-art multi-pass algorithms proposed for static databases of sequences. Moreover, our approach scales linearly with the size of data streams while all the existing algorithms do not.","PeriodicalId":126062,"journal":{"name":"Proceedings of the ACM SIGKDD Workshop on Interactive Data Exploration and Analytics","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM SIGKDD Workshop on Interactive Data Exploration and Analytics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2501511.2501520","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

We propose a streaming algorithm, based on the minimal description length (MDL) principle, for extracting non-redundant sequential patterns. For static databases, the MDL-based approach that selects patterns based on their capacity to compress data rather than their frequency, was shown to be remarkably effective for extracting meaningful patterns and solving the redundancy issue in frequent itemset and sequence mining. The existing MDL-based algorithms, however, either start from a seed set of frequent patterns, or require multiple passes through the data. As such, the existing approaches scale poorly and are unsuitable for large datasets. Therefore, our main contribution is the proposal of a new, streaming algorithm, called Zips, that does not require a seed set of patterns and requires only one scan over the data. For Zips, we extended the Lempel-Ziv (LZ) compression algorithm in three ways: first, whereas LZ assigns codes uniformly as it builds up its dictionary while scanning the input, Zips assigns codewords according to the usage of the dictionary words; more heaviliy used words get shorter code-lengths. Secondly, Zips exploits also non-consecutive occurences of dictionary words for compression. And, third, the well-known space-saving algorithm is used to evict unpromising words from the dictionary. Experiments on one synthetic and two real-world large-scale datasets show that our approach extracts meaningful compressing patterns with similar quality to the state-of-the-art multi-pass algorithms proposed for static databases of sequences. Moreover, our approach scales linearly with the size of data streams while all the existing algorithms do not.

查看原文本刊更多论文

压缩:挖掘压缩流中的顺序模式

我们提出了一种基于最小描述长度(MDL)原则的流算法，用于提取非冗余序列模式。对于静态数据库，基于mdl的方法根据压缩数据的能力而不是频率选择模式，对于提取有意义的模式和解决频繁项集和序列挖掘中的冗余问题非常有效。然而，现有的基于mdl的算法要么从频繁模式的种子集开始，要么需要多次遍历数据。因此，现有方法的可扩展性很差，不适合大型数据集。因此，我们的主要贡献是提出一种新的流算法，称为zip，它不需要模式的种子集，只需要对数据进行一次扫描。对于zip，我们以三种方式扩展了Lempel-Ziv (LZ)压缩算法:首先，LZ在扫描输入时建立字典时统一分配代码，而zip根据字典单词的使用情况分配码字;使用频率越高的单词的代码长度越短。其次，zip还利用字典中不连续出现的单词进行压缩。第三，使用了众所周知的节省空间算法来从字典中剔除无用的单词。在一个合成数据集和两个真实世界大规模数据集上的实验表明，我们的方法提取了有意义的压缩模式，其质量与针对序列静态数据库提出的最先进的多通道算法相似。此外，我们的方法随数据流的大小线性扩展，而所有现有算法都没有。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the ACM SIGKDD Workshop on Interactive Data Exploration and Analytics

自引率

0.00%

发文量