Somayah Karsoum, Clark Barrus, L. Gruenwald, Eleazar Leal
{"title":"Mining timed sequential patterns: The Minits-AllOcc technique","authors":"Somayah Karsoum, Clark Barrus, L. Gruenwald, Eleazar Leal","doi":"10.32629/jai.v6i1.593","DOIUrl":null,"url":null,"abstract":"Sequential pattern mining is one of the data mining tasks used to find the subsequences in a sequence dataset that appear together in order based on time. Sequence data can be collected from devices, such as sensors, GPS, or satellites, and ordered based on timestamps, which are the times when they are generated/collected. Mining patterns in such data can be used to support many applications, including transportation recommendation systems, transportation safety, weather forecasting, and disease symptom analysis. Numerous techniques have been proposed to address the problem of how to mine subsequences in a sequence dataset; however, current traditional algorithms ignore the temporal information between the itemset in a sequential pattern. This information is essential in many situations. Though knowing that measurement Y occurs after measurement X is valuable, it is more valuable to know the estimated time before the appearance of measurement Y, for example, to schedule maintenance at the right time to prevent railway damage. Considering temporal relationship information for sequential patterns raises new issues to be solved, such as designing a new data structure to save this information and traversing this structure efficiently to discover patterns without re-scanning the database. In this paper, we propose an algorithm called Minits-AllOcc (MINIng Timed Sequential Pattern for All-time Occurrences) to find sequential patterns and the transition time between itemsets based on all occurrences of a pattern in the database. We also propose a parallel multi-core CPU version of this algorithm, called MMinits-AllOcc (Multi-core for MINIng Timed Sequential Pattern for All-time Occurrences), to deal with Big Data. Extensive experiments on real and synthetic datasets show the advantages of this approach over the brute-force method. Also, the multi-core CPU version of the algorithm is shown to outperform the single-core version on Big Data by 2.5X.","PeriodicalId":70721,"journal":{"name":"自主智能(英文)","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"自主智能(英文)","FirstCategoryId":"1093","ListUrlMain":"https://doi.org/10.32629/jai.v6i1.593","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Sequential pattern mining is one of the data mining tasks used to find the subsequences in a sequence dataset that appear together in order based on time. Sequence data can be collected from devices, such as sensors, GPS, or satellites, and ordered based on timestamps, which are the times when they are generated/collected. Mining patterns in such data can be used to support many applications, including transportation recommendation systems, transportation safety, weather forecasting, and disease symptom analysis. Numerous techniques have been proposed to address the problem of how to mine subsequences in a sequence dataset; however, current traditional algorithms ignore the temporal information between the itemset in a sequential pattern. This information is essential in many situations. Though knowing that measurement Y occurs after measurement X is valuable, it is more valuable to know the estimated time before the appearance of measurement Y, for example, to schedule maintenance at the right time to prevent railway damage. Considering temporal relationship information for sequential patterns raises new issues to be solved, such as designing a new data structure to save this information and traversing this structure efficiently to discover patterns without re-scanning the database. In this paper, we propose an algorithm called Minits-AllOcc (MINIng Timed Sequential Pattern for All-time Occurrences) to find sequential patterns and the transition time between itemsets based on all occurrences of a pattern in the database. We also propose a parallel multi-core CPU version of this algorithm, called MMinits-AllOcc (Multi-core for MINIng Timed Sequential Pattern for All-time Occurrences), to deal with Big Data. Extensive experiments on real and synthetic datasets show the advantages of this approach over the brute-force method. Also, the multi-core CPU version of the algorithm is shown to outperform the single-core version on Big Data by 2.5X.
序列模式挖掘是一种数据挖掘任务,用于查找序列数据集中根据时间按顺序出现在一起的子序列。序列数据可以从传感器、GPS或卫星等设备收集,并根据时间戳进行排序,时间戳是生成/收集序列数据的时间。此类数据中的挖掘模式可用于支持许多应用程序,包括交通推荐系统、交通安全、天气预报和疾病症状分析。已经提出了许多技术来解决如何在序列数据集中挖掘子序列的问题;然而,目前的传统算法忽略了序列模式中项目集之间的时间信息。这些信息在许多情况下都是必不可少的。虽然知道测量Y发生在测量X之后是有价值的,但知道测量Y出现之前的估计时间更为有价值,例如,在正确的时间安排维护以防止铁路损坏。考虑顺序模式的时间关系信息提出了需要解决的新问题,例如设计一个新的数据结构来保存这些信息,并在不重新扫描数据库的情况下高效地遍历这个结构来发现模式。在本文中,我们提出了一种称为Minits-AllOcc(所有时间出现的MINIng Timed Sequential Pattern for All time Occurrences)的算法,以基于数据库中模式的所有出现来查找序列模式和项目集之间的转换时间。我们还提出了该算法的并行多核CPU版本,称为MMinits-AllOcc(用于最小化所有时间发生的定时序列模式的多核),以处理大数据。在真实数据集和合成数据集上进行的大量实验表明,与暴力方法相比,这种方法具有优势。此外,该算法的多核CPU版本在大数据上的表现比单核版本好2.5倍。