Improve Data Mining Techniques with a High-Performance Cluster

2022 International Conference on Computer and Applications (ICCA) Pub Date : 2022-12-20 DOI:10.1109/ICCA56443.2022.10039629

H. Fadhil, Zainab Abdulnasser, S. Mohammed

{"title":"Improve Data Mining Techniques with a High-Performance Cluster","authors":"H. Fadhil, Zainab Abdulnasser, S. Mohammed","doi":"10.1109/ICCA56443.2022.10039629","DOIUrl":null,"url":null,"abstract":"People's reliance on computers and the computing power they provide is growing by the minute. An ever-increasing amount of data is being created each day, and the power to analyze this data requires the use of cluster computers to process and calculate data. It has been discovered that data clustering is a beneficial data mining approach. There have been a number of recent attempts to cluster data mining methods. Using a Raspberry Pi cluster, this study employs the Apriori algorithm, which is the most generally used algorithm, to extract frequent itemsets from large data sets. The fundamental aim is to build a cluster and provide data analysis capabilities based on an examination of the major clustering phases in order to illustrate the power of cluster computing and the applications of data analytics. Each Raspberry Pi uses the MPI standard and Python multiprocessing to share a large task and then coordinate their findings among a group of four or more MPICH systems at the conclusion of the processing. At the data partitioning stage, the issue of load balancing must be taken into account. According to our testing results, clustering accelerates sequential classification by a factor of 10. There is a noticeable increase in performance when there are additional processors installed. Additionally, we discovered that item count had a bigger effect on clustering performance than transaction count.","PeriodicalId":153139,"journal":{"name":"2022 International Conference on Computer and Applications (ICCA)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Conference on Computer and Applications (ICCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCA56443.2022.10039629","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

People's reliance on computers and the computing power they provide is growing by the minute. An ever-increasing amount of data is being created each day, and the power to analyze this data requires the use of cluster computers to process and calculate data. It has been discovered that data clustering is a beneficial data mining approach. There have been a number of recent attempts to cluster data mining methods. Using a Raspberry Pi cluster, this study employs the Apriori algorithm, which is the most generally used algorithm, to extract frequent itemsets from large data sets. The fundamental aim is to build a cluster and provide data analysis capabilities based on an examination of the major clustering phases in order to illustrate the power of cluster computing and the applications of data analytics. Each Raspberry Pi uses the MPI standard and Python multiprocessing to share a large task and then coordinate their findings among a group of four or more MPICH systems at the conclusion of the processing. At the data partitioning stage, the issue of load balancing must be taken into account. According to our testing results, clustering accelerates sequential classification by a factor of 10. There is a noticeable increase in performance when there are additional processors installed. Additionally, we discovered that item count had a bigger effect on clustering performance than transaction count.

查看原文本刊更多论文

利用高性能集群改进数据挖掘技术

人们对计算机及其提供的计算能力的依赖正在与日俱增。每天创建的数据量都在不断增加，分析这些数据的能力需要使用集群计算机来处理和计算数据。数据聚类是一种有益的数据挖掘方法。最近有许多对聚类数据挖掘方法的尝试。本研究使用树莓派聚类，采用最常用的Apriori算法从大型数据集中提取频繁项集。基本目标是构建一个集群，并基于对主要集群阶段的研究提供数据分析功能，以说明集群计算的强大功能和数据分析的应用。每个树莓派都使用MPI标准和Python多处理来共享一个大任务，然后在处理结束时在一个由四个或更多MPICH系统组成的组中协调它们的发现。在数据分区阶段，必须考虑负载平衡问题。根据我们的测试结果，聚类将顺序分类的速度提高了10倍。当安装了额外的处理器时，性能会有明显的提高。此外，我们发现项目数比事务数对集群性能的影响更大。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 International Conference on Computer and Applications (ICCA)

自引率

0.00%

发文量