Scalable Mining of High-Utility Sequential Patterns With Three-Tier MapReduce Model

ACM Transactions on Knowledge Discovery from Data (TKDD) Pub Date : 2021-11-15 DOI:10.1145/3487046

Chun-Wei Lin, Y. Djenouri, Gautam Srivastava, Yuanfa Li, Philip S. Yu

{"title":"Scalable Mining of High-Utility Sequential Patterns With Three-Tier MapReduce Model","authors":"Chun-Wei Lin, Y. Djenouri, Gautam Srivastava, Yuanfa Li, Philip S. Yu","doi":"10.1145/3487046","DOIUrl":null,"url":null,"abstract":"High-utility sequential pattern mining (HUSPM) is a hot research topic in recent decades since it combines both sequential and utility properties to reveal more information and knowledge rather than the traditional frequent itemset mining or sequential pattern mining. Several works of HUSPM have been presented but most of them are based on main memory to speed up mining performance. However, this assumption is not realistic and not suitable in large-scale environments since in real industry, the size of the collected data is very huge and it is impossible to fit the data into the main memory of a single machine. In this article, we first develop a parallel and distributed three-stage MapReduce model for mining high-utility sequential patterns based on large-scale databases. Two properties are then developed to hold the correctness and completeness of the discovered patterns in the developed framework. In addition, two data structures called sidset and utility-linked list are utilized in the developed framework to accelerate the computation for mining the required patterns. From the results, we can observe that the designed model has good performance in large-scale datasets in terms of runtime, memory, efficiency of the number of distributed nodes, and scalability compared to the serial HUSP-Span approach.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"23","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Knowledge Discovery from Data (TKDD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3487046","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 23

Abstract

High-utility sequential pattern mining (HUSPM) is a hot research topic in recent decades since it combines both sequential and utility properties to reveal more information and knowledge rather than the traditional frequent itemset mining or sequential pattern mining. Several works of HUSPM have been presented but most of them are based on main memory to speed up mining performance. However, this assumption is not realistic and not suitable in large-scale environments since in real industry, the size of the collected data is very huge and it is impossible to fit the data into the main memory of a single machine. In this article, we first develop a parallel and distributed three-stage MapReduce model for mining high-utility sequential patterns based on large-scale databases. Two properties are then developed to hold the correctness and completeness of the discovered patterns in the developed framework. In addition, two data structures called sidset and utility-linked list are utilized in the developed framework to accelerate the computation for mining the required patterns. From the results, we can observe that the designed model has good performance in large-scale datasets in terms of runtime, memory, efficiency of the number of distributed nodes, and scalability compared to the serial HUSP-Span approach.

查看原文本刊更多论文

基于三层MapReduce模型的高效用序列模式可扩展挖掘

高效用序列模式挖掘(HUSPM)是近几十年来的一个研究热点，它结合了序列和效用的特性，比传统的频繁项集挖掘或序列模式挖掘更能揭示信息和知识。HUSPM已经提出了一些工作，但大多数都是基于主存来提高挖掘性能。然而，这种假设是不现实的，不适合大规模的环境，因为在实际工业中，收集的数据的大小是非常巨大的，不可能将数据放入单个机器的主存储器中。在本文中，我们首先开发了一个并行和分布式的三阶段MapReduce模型，用于挖掘基于大规模数据库的高实用顺序模式。然后开发两个属性来保证所开发框架中所发现模式的正确性和完整性。此外，在开发的框架中还使用了两种数据结构sidset和utility-linked list来加速挖掘所需模式的计算。从结果可以看出，与串行HUSP-Span方法相比，所设计的模型在运行时间、内存、分布式节点数量效率和可扩展性方面具有良好的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Knowledge Discovery from Data (TKDD)

自引率

0.00%

发文量