Stream Sequential Pattern Mining with Precise Error Bounds

2008 Eighth IEEE International Conference on Data Mining Pub Date : 2008-12-15 DOI:10.1109/ICDM.2008.154

L. F. Mendes, Bolin Ding, Jiawei Han

{"title":"Stream Sequential Pattern Mining with Precise Error Bounds","authors":"L. F. Mendes, Bolin Ding, Jiawei Han","doi":"10.1109/ICDM.2008.154","DOIUrl":null,"url":null,"abstract":"Sequential pattern mining is an interesting data mining problem with many real-world applications. This problem has been studied extensively in static databases. However, in recent years, emerging applications have introduced a new form of data called data stream. In a data stream, new elements are generated continuously. This poses additional constraints on the methods used for mining such data: memory usage is restricted, the infinitely flowing original dataset cannot be scanned multiple times, and current results should be available on demand.This paper introduces two effective methods for mining sequential patterns from data streams: the SS-BE method and the SS-MB method. The proposed methods break the stream into batches and only process each batch once. The two methods use different pruning strategies that restrict the memory usage but can still guarantee that all true sequential patterns are output at the end of any batch. Both algorithms scale linearly in execution time as the number of sequences grows, making them effective methods for sequential pattern mining in data streams. The experimental results also show that our methods are very accurate in that only a small fraction of the patterns that are output are false positives. Even for these false positives, SS-BE guarantees that their true support is above a pre-defined threshold.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"66","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 Eighth IEEE International Conference on Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDM.2008.154","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 66

Abstract

Sequential pattern mining is an interesting data mining problem with many real-world applications. This problem has been studied extensively in static databases. However, in recent years, emerging applications have introduced a new form of data called data stream. In a data stream, new elements are generated continuously. This poses additional constraints on the methods used for mining such data: memory usage is restricted, the infinitely flowing original dataset cannot be scanned multiple times, and current results should be available on demand.This paper introduces two effective methods for mining sequential patterns from data streams: the SS-BE method and the SS-MB method. The proposed methods break the stream into batches and only process each batch once. The two methods use different pruning strategies that restrict the memory usage but can still guarantee that all true sequential patterns are output at the end of any batch. Both algorithms scale linearly in execution time as the number of sequences grows, making them effective methods for sequential pattern mining in data streams. The experimental results also show that our methods are very accurate in that only a small fraction of the patterns that are output are false positives. Even for these false positives, SS-BE guarantees that their true support is above a pre-defined threshold.

查看原文本刊更多论文

具有精确错误边界的流顺序模式挖掘

顺序模式挖掘是许多实际应用程序中一个有趣的数据挖掘问题。这个问题在静态数据库中得到了广泛的研究。然而，近年来，新兴的应用程序引入了一种新的数据形式，称为数据流。在数据流中，新元素不断生成。这对用于挖掘此类数据的方法提出了额外的约束:内存使用受到限制，无限流动的原始数据集不能多次扫描，当前结果应按需提供。本文介绍了从数据流中挖掘序列模式的两种有效方法:SS-BE方法和SS-MB方法。所提出的方法将数据流分成多个批次，每个批次只处理一次。这两种方法使用不同的修剪策略来限制内存使用，但仍然可以保证在任何批处理结束时输出所有真正的顺序模式。这两种算法的执行时间随着序列数量的增加呈线性增长，使它们成为数据流中序列模式挖掘的有效方法。实验结果还表明，我们的方法非常准确，输出的模式中只有一小部分是假阳性。即使对于这些误报，SS-BE也保证它们的真实支持度高于预定义的阈值。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2008 Eighth IEEE International Conference on Data Mining

自引率

0.00%

发文量