Closing the Gap: Sequence Mining at Scale

IF 2.2 2区 计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS
Kaustubh Beedkar, K. Berberich, Rainer Gemulla, Iris Miliaraki
{"title":"Closing the Gap: Sequence Mining at Scale","authors":"Kaustubh Beedkar, K. Berberich, Rainer Gemulla, Iris Miliaraki","doi":"10.1145/2757217","DOIUrl":null,"url":null,"abstract":"Frequent sequence mining is one of the fundamental building blocks in data mining. While the problem has been extensively studied, few of the available techniques are sufficiently scalable to handle datasets with billions of sequences; such large-scale datasets arise, for instance, in text mining and session analysis. In this article, we propose MG-FSM, a scalable algorithm for frequent sequence mining on MapReduce. MG-FSM can handle so-called “gap constraints”, which can be used to limit the output to a controlled set of frequent sequences. Both positional and temporal gap constraints, as well as appropriate maximality and closedness constraints, are supported. At its heart, MG-FSM partitions the input database in a way that allows us to mine each partition independently using any existing frequent sequence mining algorithm. We introduce the notion of ω-equivalency, which is a generalization of the notion of a “projected database” used by many frequent pattern mining algorithms. We also present a number of optimization techniques that minimize partition size, and therefore computational and communication costs, while still maintaining correctness. Our experimental study in the contexts of text mining and session analysis suggests that MG-FSM is significantly more efficient and scalable than alternative approaches.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"10 1","pages":"8:1-8:44"},"PeriodicalIF":2.2000,"publicationDate":"2015-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Database Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/2757217","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 11

Abstract

Frequent sequence mining is one of the fundamental building blocks in data mining. While the problem has been extensively studied, few of the available techniques are sufficiently scalable to handle datasets with billions of sequences; such large-scale datasets arise, for instance, in text mining and session analysis. In this article, we propose MG-FSM, a scalable algorithm for frequent sequence mining on MapReduce. MG-FSM can handle so-called “gap constraints”, which can be used to limit the output to a controlled set of frequent sequences. Both positional and temporal gap constraints, as well as appropriate maximality and closedness constraints, are supported. At its heart, MG-FSM partitions the input database in a way that allows us to mine each partition independently using any existing frequent sequence mining algorithm. We introduce the notion of ω-equivalency, which is a generalization of the notion of a “projected database” used by many frequent pattern mining algorithms. We also present a number of optimization techniques that minimize partition size, and therefore computational and communication costs, while still maintaining correctness. Our experimental study in the contexts of text mining and session analysis suggests that MG-FSM is significantly more efficient and scalable than alternative approaches.
缩小差距:大规模的序列挖掘
频繁序列挖掘是数据挖掘的基本组成部分之一。虽然这个问题已经得到了广泛的研究,但很少有可用的技术能够充分扩展到处理具有数十亿序列的数据集;例如,这种大规模数据集出现在文本挖掘和会话分析中。在本文中,我们提出了MG-FSM,一种在MapReduce上进行频繁序列挖掘的可扩展算法。MG-FSM可以处理所谓的“间隙约束”,它可以用来限制输出到一组受控的频繁序列。支持位置和时间间隙约束,以及适当的最大值和封闭性约束。在其核心,MG-FSM以一种允许我们使用任何现有的频繁序列挖掘算法独立挖掘每个分区的方式对输入数据库进行分区。我们引入了ω-等价的概念,这是许多频繁模式挖掘算法所使用的“投影数据库”概念的推广。我们还介绍了一些优化技术,这些技术可以最小化分区大小,从而减少计算和通信成本,同时仍然保持正确性。我们在文本挖掘和会话分析背景下的实验研究表明,MG-FSM比其他方法更有效和可扩展。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
ACM Transactions on Database Systems
ACM Transactions on Database Systems 工程技术-计算机:软件工程
CiteScore
5.60
自引率
0.00%
发文量
15
审稿时长
>12 weeks
期刊介绍: Heavily used in both academic and corporate R&D settings, ACM Transactions on Database Systems (TODS) is a key publication for computer scientists working in data abstraction, data modeling, and designing data management systems. Topics include storage and retrieval, transaction management, distributed and federated databases, semantics of data, intelligent databases, and operations and algorithms relating to these areas. In this rapidly changing field, TODS provides insights into the thoughts of the best minds in database R&D.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信