Space Efficient String Mining under Frequency Constraints

2008 Eighth IEEE International Conference on Data Mining Pub Date : 2008-12-15 DOI:10.1109/ICDM.2008.32

J. Fischer, V. Mäkinen, Niko Välimäki

{"title":"Space Efficient String Mining under Frequency Constraints","authors":"J. Fischer, V. Mäkinen, Niko Välimäki","doi":"10.1109/ICDM.2008.32","DOIUrl":null,"url":null,"abstract":"Let D1 and D2 be two databases (i.e. multisets) of d strings, over an alphabet Sigma, with overall length n. We study the problem of mining discriminative patterns between D1 and D2 - e.g., patterns that are frequent in one database but not in the other, emerging patterns, or patterns satisfying other frequency-related constraints. Using the algorithmic framework by Hui (CPM 1992), one can solve several variants of this problem in the optimal linear time with the aid of suffix trees or suffix arrays. This stands in high contrast to other pattern domains such as item-sets or subgraphs, where super-linear lower bounds are known. However, the space requirement of existing solutions is O(n log n) bits, which is not optimal for |Sigma| Lt n (in particular for constant |Sigma|), as the databases themselves occupy only n log |Sigma| bits. Because in many real-life applications space is a more critical resource than time, the aim of this article is to reduce the space, at the cost of an increased running time. In particular, we give a solution for the above problems that uses O(n log |Sigma| + d log n) bits, while the time requirement is increased from the optimal linear time to O(n log n). Our new method is tested extensively on a biologically relevant datasets and shown to be usable even on a genome-scale data.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"29","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 Eighth IEEE International Conference on Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDM.2008.32","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 29

Abstract

Let D1 and D2 be two databases (i.e. multisets) of d strings, over an alphabet Sigma, with overall length n. We study the problem of mining discriminative patterns between D1 and D2 - e.g., patterns that are frequent in one database but not in the other, emerging patterns, or patterns satisfying other frequency-related constraints. Using the algorithmic framework by Hui (CPM 1992), one can solve several variants of this problem in the optimal linear time with the aid of suffix trees or suffix arrays. This stands in high contrast to other pattern domains such as item-sets or subgraphs, where super-linear lower bounds are known. However, the space requirement of existing solutions is O(n log n) bits, which is not optimal for |Sigma| Lt n (in particular for constant |Sigma|), as the databases themselves occupy only n log |Sigma| bits. Because in many real-life applications space is a more critical resource than time, the aim of this article is to reduce the space, at the cost of an increased running time. In particular, we give a solution for the above problems that uses O(n log |Sigma| + d log n) bits, while the time requirement is increased from the optimal linear time to O(n log n). Our new method is tested extensively on a biologically relevant datasets and shown to be usable even on a genome-scale data.

查看原文本刊更多论文

频率约束下的空间高效字符串挖掘

设D1和D2是d个字符串的两个数据库(即多集)，在字母表Sigma上，总长度为n。我们研究D1和D2之间的判别模式挖掘问题-例如，在一个数据库中频繁出现但在另一个数据库中不出现的模式，新出现的模式，或满足其他频率相关约束的模式。使用Hui (CPM 1992)的算法框架，可以借助后缀树或后缀数组在最优线性时间内解决该问题的多个变体。这与其他模式域(如项集或子图)形成鲜明对比，这些模式域的超线性下界是已知的。然而，现有解决方案的空间需求是O(n log n)位，这对于|Sigma| Lt n来说不是最优的(特别是对于常数|Sigma|)，因为数据库本身只占用n log |Sigma|位。由于在许多实际应用程序中，空间是比时间更重要的资源，因此本文的目的是减少空间，但代价是增加运行时间。特别是，我们给出了一个解决上述问题的方案，使用O(n log |Sigma| + d log n)位，而时间要求从最优线性时间增加到O(n log n)。我们的新方法在生物学相关数据集上进行了广泛的测试，并证明即使在基因组规模的数据上也是可用的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2008 Eighth IEEE International Conference on Data Mining

自引率

0.00%

发文量