注意差距:大规模频繁的序列挖掘

Proceedings. ACM-SIGMOD International Conference on Management of Data Pub Date : 2013-06-22 DOI:10.1145/2463676.2465285

Iris Miliaraki, K. Berberich, Rainer Gemulla, Spyros Zoupanos

{"title":"注意差距:大规模频繁的序列挖掘","authors":"Iris Miliaraki, K. Berberich, Rainer Gemulla, Spyros Zoupanos","doi":"10.1145/2463676.2465285","DOIUrl":null,"url":null,"abstract":"Frequent sequence mining is one of the fundamental building blocks in data mining. While the problem has been extensively studied, few of the available techniques are sufficiently scalable to handle datasets with billions of sequences; such large-scale datasets arise, for instance, in text mining and session analysis. In this paper, we propose MG-FSM, a scalable algorithm for frequent sequence mining on MapReduce. MG-FSM can handle so-called \"gap constraints\", which can be used to limit the output to a controlled set of frequent sequences. At its heart, MG-FSM partitions the input database in a way that allows us to mine each partition independently using any existing frequent sequence mining algorithm. We introduce the notion of w-equivalency, which is a generalization of the notion of a \"projected database\" used by many frequent pattern mining algorithms. We also present a number of optimization techniques that minimize partition size, and therefore computational and communication costs, while still maintaining correctness. Our experimental study in the context of text mining suggests that MG-FSM is significantly more efficient and scalable than alternative approaches.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"423 1","pages":"797-808"},"PeriodicalIF":0.0000,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"68","resultStr":"{\"title\":\"Mind the gap: large-scale frequent sequence mining\",\"authors\":\"Iris Miliaraki, K. Berberich, Rainer Gemulla, Spyros Zoupanos\",\"doi\":\"10.1145/2463676.2465285\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Frequent sequence mining is one of the fundamental building blocks in data mining. While the problem has been extensively studied, few of the available techniques are sufficiently scalable to handle datasets with billions of sequences; such large-scale datasets arise, for instance, in text mining and session analysis. In this paper, we propose MG-FSM, a scalable algorithm for frequent sequence mining on MapReduce. MG-FSM can handle so-called \\\"gap constraints\\\", which can be used to limit the output to a controlled set of frequent sequences. At its heart, MG-FSM partitions the input database in a way that allows us to mine each partition independently using any existing frequent sequence mining algorithm. We introduce the notion of w-equivalency, which is a generalization of the notion of a \\\"projected database\\\" used by many frequent pattern mining algorithms. We also present a number of optimization techniques that minimize partition size, and therefore computational and communication costs, while still maintaining correctness. Our experimental study in the context of text mining suggests that MG-FSM is significantly more efficient and scalable than alternative approaches.\",\"PeriodicalId\":87344,\"journal\":{\"name\":\"Proceedings. ACM-SIGMOD International Conference on Management of Data\",\"volume\":\"423 1\",\"pages\":\"797-808\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-06-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"68\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings. ACM-SIGMOD International Conference on Management of Data\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2463676.2465285\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. ACM-SIGMOD International Conference on Management of Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2463676.2465285","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 68

摘要

频繁序列挖掘是数据挖掘的基本组成部分之一。虽然这个问题已经得到了广泛的研究，但很少有可用的技术能够充分扩展到处理具有数十亿序列的数据集;例如，这种大规模数据集出现在文本挖掘和会话分析中。本文提出了一种用于MapReduce频繁序列挖掘的可扩展算法MG-FSM。MG-FSM可以处理所谓的“间隙约束”，它可以用来限制输出到一组受控的频繁序列。在其核心，MG-FSM以一种允许我们使用任何现有的频繁序列挖掘算法独立挖掘每个分区的方式对输入数据库进行分区。我们引入了w-等价的概念，这是许多频繁模式挖掘算法所使用的“投影数据库”概念的推广。我们还介绍了一些优化技术，这些技术可以最小化分区大小，从而减少计算和通信成本，同时仍然保持正确性。我们在文本挖掘上下文中的实验研究表明，MG-FSM比其他方法更有效和可扩展。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Mind the gap: large-scale frequent sequence mining

Frequent sequence mining is one of the fundamental building blocks in data mining. While the problem has been extensively studied, few of the available techniques are sufficiently scalable to handle datasets with billions of sequences; such large-scale datasets arise, for instance, in text mining and session analysis. In this paper, we propose MG-FSM, a scalable algorithm for frequent sequence mining on MapReduce. MG-FSM can handle so-called "gap constraints", which can be used to limit the output to a controlled set of frequent sequences. At its heart, MG-FSM partitions the input database in a way that allows us to mine each partition independently using any existing frequent sequence mining algorithm. We introduce the notion of w-equivalency, which is a generalization of the notion of a "projected database" used by many frequent pattern mining algorithms. We also present a number of optimization techniques that minimize partition size, and therefore computational and communication costs, while still maintaining correctness. Our experimental study in the context of text mining suggests that MG-FSM is significantly more efficient and scalable than alternative approaches.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings. ACM-SIGMOD International Conference on Management of Data

自引率

0.00%

发文量