ParaDiS: a Parallel and Distributed framework for Significant pattern mining

2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing Workshops (CCGridW) Pub Date : 2023-05-01 DOI:10.1109/CCGridW59191.2023.00050

Jyoti, S. Kailasam, A. Buzmakov

{"title":"ParaDiS: a Parallel and Distributed framework for Significant pattern mining","authors":"Jyoti, S. Kailasam, A. Buzmakov","doi":"10.1109/CCGridW59191.2023.00050","DOIUrl":null,"url":null,"abstract":"Mining patterns having a high association with a class label is a supervised data mining technique, frequently used in many applications. As we test many patterns using statistical tests to find all interesting patterns, a certain association is likely achieved by chance. The state-of-the-art TopKWY algorithm mines the top-k interesting patterns while controlling the family-wise-error rate (FWER) in the result set. TopKWY is a sequential algorithm that internally uses compute-intensive closed pattern mining. Moreover, it tests several patterns against thousands of permuted class labels to control FWER. To the best of our knowledge, no parallel/distributed implementation exists to address the scalability challenges faced by TopKWY. The tree formed by the explored patterns in TopKWY is inherently irregular and the search strategy used for exploration, namely, the best-first search is non-trivial to emulate in a distributed setup. This paper designs and implements ParaDiS, a novel parallel and distributed framework for mining the top-k statistically significant patterns. We compare its performance with the sequential TopKWY algorithm for real-world datasets and observe a significant reduction in execution time. We further show that our framework achieves good speedup, minimal communication overhead, and faster pruning of non-promising branches by efficient sharing of significance threshold.","PeriodicalId":341115,"journal":{"name":"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing Workshops (CCGridW)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing Workshops (CCGridW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGridW59191.2023.00050","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Mining patterns having a high association with a class label is a supervised data mining technique, frequently used in many applications. As we test many patterns using statistical tests to find all interesting patterns, a certain association is likely achieved by chance. The state-of-the-art TopKWY algorithm mines the top-k interesting patterns while controlling the family-wise-error rate (FWER) in the result set. TopKWY is a sequential algorithm that internally uses compute-intensive closed pattern mining. Moreover, it tests several patterns against thousands of permuted class labels to control FWER. To the best of our knowledge, no parallel/distributed implementation exists to address the scalability challenges faced by TopKWY. The tree formed by the explored patterns in TopKWY is inherently irregular and the search strategy used for exploration, namely, the best-first search is non-trivial to emulate in a distributed setup. This paper designs and implements ParaDiS, a novel parallel and distributed framework for mining the top-k statistically significant patterns. We compare its performance with the sequential TopKWY algorithm for real-world datasets and observe a significant reduction in execution time. We further show that our framework achieves good speedup, minimal communication overhead, and faster pruning of non-promising branches by efficient sharing of significance threshold.

查看原文本刊更多论文

ParaDiS:用于重要模式挖掘的并行和分布式框架

挖掘与类标签高度关联的模式是一种受监督的数据挖掘技术，在许多应用程序中经常使用。当我们使用统计测试测试许多模式以找到所有有趣的模式时，某种关联可能是偶然实现的。最先进的TopKWY算法挖掘top-k个有趣的模式，同时控制结果集中的家族错误率(FWER)。TopKWY是一个内部使用计算密集型封闭模式挖掘的顺序算法。此外，它还针对数千个排列的类标签测试几种模式，以控制FWER。据我们所知，目前还没有并行/分布式实现来解决TopKWY所面临的可伸缩性挑战。TopKWY中由探索模式形成的树本质上是不规则的，并且用于探索的搜索策略，即最佳优先搜索，在分布式设置中是不容易模拟的。本文设计并实现了一种新的用于top-k统计显著性模式挖掘的并行分布式框架ParaDiS。我们将其性能与实际数据集的顺序TopKWY算法进行了比较，并观察到执行时间的显着减少。我们进一步表明，我们的框架通过有效地共享显著性阈值，实现了良好的加速、最小的通信开销和更快地修剪非有希望的分支。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing Workshops (CCGridW)

自引率

0.00%

发文量