Closing the Gap: Sequence Mining at Scale

IF 2.2 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems Pub Date : 2015-06-30 DOI:10.1145/2757217

Kaustubh Beedkar, K. Berberich, Rainer Gemulla, Iris Miliaraki

{"title":"Closing the Gap: Sequence Mining at Scale","authors":"Kaustubh Beedkar, K. Berberich, Rainer Gemulla, Iris Miliaraki","doi":"10.1145/2757217","DOIUrl":null,"url":null,"abstract":"Frequent sequence mining is one of the fundamental building blocks in data mining. While the problem has been extensively studied, few of the available techniques are sufficiently scalable to handle datasets with billions of sequences; such large-scale datasets arise, for instance, in text mining and session analysis. In this article, we propose MG-FSM, a scalable algorithm for frequent sequence mining on MapReduce. MG-FSM can handle so-called “gap constraints”, which can be used to limit the output to a controlled set of frequent sequences. Both positional and temporal gap constraints, as well as appropriate maximality and closedness constraints, are supported. At its heart, MG-FSM partitions the input database in a way that allows us to mine each partition independently using any existing frequent sequence mining algorithm. We introduce the notion of ω-equivalency, which is a generalization of the notion of a “projected database” used by many frequent pattern mining algorithms. We also present a number of optimization techniques that minimize partition size, and therefore computational and communication costs, while still maintaining correctness. Our experimental study in the contexts of text mining and session analysis suggests that MG-FSM is significantly more efficient and scalable than alternative approaches.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"10 1","pages":"8:1-8:44"},"PeriodicalIF":2.2000,"publicationDate":"2015-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Database Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/2757217","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 11

Abstract

Frequent sequence mining is one of the fundamental building blocks in data mining. While the problem has been extensively studied, few of the available techniques are sufficiently scalable to handle datasets with billions of sequences; such large-scale datasets arise, for instance, in text mining and session analysis. In this article, we propose MG-FSM, a scalable algorithm for frequent sequence mining on MapReduce. MG-FSM can handle so-called “gap constraints”, which can be used to limit the output to a controlled set of frequent sequences. Both positional and temporal gap constraints, as well as appropriate maximality and closedness constraints, are supported. At its heart, MG-FSM partitions the input database in a way that allows us to mine each partition independently using any existing frequent sequence mining algorithm. We introduce the notion of ω-equivalency, which is a generalization of the notion of a “projected database” used by many frequent pattern mining algorithms. We also present a number of optimization techniques that minimize partition size, and therefore computational and communication costs, while still maintaining correctness. Our experimental study in the contexts of text mining and session analysis suggests that MG-FSM is significantly more efficient and scalable than alternative approaches.

查看原文本刊更多论文

缩小差距:大规模的序列挖掘

频繁序列挖掘是数据挖掘的基本组成部分之一。虽然这个问题已经得到了广泛的研究，但很少有可用的技术能够充分扩展到处理具有数十亿序列的数据集;例如，这种大规模数据集出现在文本挖掘和会话分析中。在本文中，我们提出了MG-FSM，一种在MapReduce上进行频繁序列挖掘的可扩展算法。MG-FSM可以处理所谓的“间隙约束”，它可以用来限制输出到一组受控的频繁序列。支持位置和时间间隙约束，以及适当的最大值和封闭性约束。在其核心，MG-FSM以一种允许我们使用任何现有的频繁序列挖掘算法独立挖掘每个分区的方式对输入数据库进行分区。我们引入了ω-等价的概念，这是许多频繁模式挖掘算法所使用的“投影数据库”概念的推广。我们还介绍了一些优化技术，这些技术可以最小化分区大小，从而减少计算和通信成本，同时仍然保持正确性。我们在文本挖掘和会话分析背景下的实验研究表明，MG-FSM比其他方法更有效和可扩展。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Database Systems 工程技术-计算机：软件工程

CiteScore

5.60

自引率

0.00%

发文量

审稿时长

>12 weeks

期刊介绍： Heavily used in both academic and corporate R&D settings, ACM Transactions on Database Systems (TODS) is a key publication for computer scientists working in data abstraction, data modeling, and designing data management systems. Topics include storage and retrieval, transaction management, distributed and federated databases, semantics of data, intelligent databases, and operations and algorithms relating to these areas. In this rapidly changing field, TODS provides insights into the thoughts of the best minds in database R&D.