Research on Parallel Data Mining Based on Spark

2022 International Symposium on Advances in Informatics, Electronics and Education (ISAIEE) Pub Date : 2022-12-01 DOI:10.1109/ISAIEE57420.2022.00033

Jiali Shen

引用次数: 0

Abstract

In the current era of big data, the rapid development of network technology and hardware equipment leads to exponential data growth. However, under the challenge of massive data, there are still some problems in the field of data mining, such as low efficiency of algorithm execution, insufficient parallel optimization of algorithms and poor usability of data mining platforms. This paper focuses on parallel data mining algorithms and parallel data mining tools. Based on Spark as a programming model and processing engine, a distributed parallel data mining scheduling framework is designed and implemented based on Hadoop and Spark, which can meet the needs of users for mining and analyzing large data sets. The scheduling system implements common data mining algorithms such as classification, prediction, clustering and data preprocessing, and can complete data mining modeling by visual drag and drop algorithm program.

查看原文本刊更多论文

基于Spark的并行数据挖掘研究

在当前大数据时代，网络技术和硬件设备的快速发展导致数据呈指数级增长。然而，在海量数据的挑战下，数据挖掘领域仍然存在算法执行效率低、算法并行优化不足、数据挖掘平台可用性差等问题。本文主要研究并行数据挖掘算法和并行数据挖掘工具。以Spark作为编程模型和处理引擎，设计并实现了一个基于Hadoop和Spark的分布式并行数据挖掘调度框架，能够满足用户对大数据集挖掘和分析的需求。调度系统实现了常用的数据挖掘算法，如分类、预测、聚类和数据预处理等，并可以通过可视化的拖放算法程序完成数据挖掘建模。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 International Symposium on Advances in Informatics, Electronics and Education (ISAIEE)

自引率

0.00%

发文量