Order, subset construction and sequential pattern mining

IF 6.8 1区计算机科学 0 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Sciences Pub Date : 2025-05-27 DOI:10.1016/j.ins.2025.122348

Slimane Oulad-Naoui , Hadda Cherroun , Djelloul Ziadi

{"title":"Order, subset construction and sequential pattern mining","authors":"Slimane Oulad-Naoui , Hadda Cherroun , Djelloul Ziadi","doi":"10.1016/j.ins.2025.122348","DOIUrl":null,"url":null,"abstract":"<div><div>Sequential Pattern Mining (SPM) is a basic task in data mining. It aims to extract the most occurring sequences in a dataset, which turns out to be instrumental in many fields. In <span><span>[1]</span></span> we initiated an attempt to formally unify leading pattern mining approaches. This paper builds upon our previous work to first extend the polynomial model to SPM. Next, we devise an efficient implementation termed WASMA that enhances the standard subset construction method. To do so, we first partition the set of states into independent sets based on their labels, and then define three different state ordering. The first is a global id-based order which we use in global exploration. The second is local and used in itemset extension. A geometric ordering is lastly exploited to avoid redundant computations. To handle the memory bottleneck of the determinization, we propose two variants: WASMA-wsc and WASMA-ssc that rely or not on the state existence check clause. Unlike existing approaches that overlook the appearance of repetitive computation paths, the first variant introduces a novel feature, since it avoids recomputing previously explored sub-branches of the problem space. Besides, we refine for the SPM setting the well-known theoretical upper-bound by establishing new complexities in function of the geometric-order topology. Evaluations demonstrate that our solution outperforms existing approaches for SPM instances with very low support thresholds, persisting sole to yield the result while its competitors hit the time limit.</div></div>","PeriodicalId":51063,"journal":{"name":"Information Sciences","volume":"717 ","pages":"Article 122348"},"PeriodicalIF":6.8000,"publicationDate":"2025-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Sciences","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0020025525004803","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Sequential Pattern Mining (SPM) is a basic task in data mining. It aims to extract the most occurring sequences in a dataset, which turns out to be instrumental in many fields. In [1] we initiated an attempt to formally unify leading pattern mining approaches. This paper builds upon our previous work to first extend the polynomial model to SPM. Next, we devise an efficient implementation termed WASMA that enhances the standard subset construction method. To do so, we first partition the set of states into independent sets based on their labels, and then define three different state ordering. The first is a global id-based order which we use in global exploration. The second is local and used in itemset extension. A geometric ordering is lastly exploited to avoid redundant computations. To handle the memory bottleneck of the determinization, we propose two variants: WASMA-wsc and WASMA-ssc that rely or not on the state existence check clause. Unlike existing approaches that overlook the appearance of repetitive computation paths, the first variant introduces a novel feature, since it avoids recomputing previously explored sub-branches of the problem space. Besides, we refine for the SPM setting the well-known theoretical upper-bound by establishing new complexities in function of the geometric-order topology. Evaluations demonstrate that our solution outperforms existing approaches for SPM instances with very low support thresholds, persisting sole to yield the result while its competitors hit the time limit.

查看原文本刊更多论文

顺序、子集构造和顺序模式挖掘

序列模式挖掘（SPM）是数据挖掘中的一项基本任务。它旨在提取数据集中出现次数最多的序列，这在许多领域都是有用的。在b[1]中，我们开始尝试正式统一领先的模式挖掘方法。本文在前人工作的基础上，首先将多项式模型扩展到SPM。接下来，我们设计了一个称为WASMA的高效实现，它增强了标准子集构造方法。为此，我们首先根据状态集的标签将其划分为独立的集合，然后定义三种不同的状态排序。第一个是我们在全球探索中使用的基于身份的全球秩序。第二个是本地的，用于项目集扩展。最后利用几何排序来避免冗余计算。为了解决确定的内存瓶颈，我们提出了两种变体：WASMA-wsc和WASMA-ssc，它们依赖或不依赖状态存在检查子句。与忽略重复计算路径的现有方法不同，第一个变体引入了一个新特性，因为它避免了重新计算以前探索过的问题空间的子分支。此外，我们通过建立几何阶拓扑函数的新复杂度来改进SPM的理论上界设置。评估表明，对于支持阈值非常低的SPM实例，我们的解决方案优于现有的方法，当其竞争对手达到时间限制时，我们的解决方案能够持久地产生结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Information Sciences 工程技术-计算机：信息系统

CiteScore

14.00

自引率

17.30%

发文量

1322

审稿时长

10.4 months

期刊介绍： Informatics and Computer Science Intelligent Systems Applications is an esteemed international journal that focuses on publishing original and creative research findings in the field of information sciences. We also feature a limited number of timely tutorial and surveying contributions. Our journal aims to cater to a diverse audience, including researchers, developers, managers, strategic planners, graduate students, and anyone interested in staying up-to-date with cutting-edge research in information science, knowledge engineering, and intelligent systems. While readers are expected to share a common interest in information science, they come from varying backgrounds such as engineering, mathematics, statistics, physics, computer science, cell biology, molecular biology, management science, cognitive science, neurobiology, behavioral sciences, and biochemistry.