基于Spark的并行数据挖掘研究

Jiali Shen
{"title":"基于Spark的并行数据挖掘研究","authors":"Jiali Shen","doi":"10.1109/ISAIEE57420.2022.00033","DOIUrl":null,"url":null,"abstract":"In the current era of big data, the rapid development of network technology and hardware equipment leads to exponential data growth. However, under the challenge of massive data, there are still some problems in the field of data mining, such as low efficiency of algorithm execution, insufficient parallel optimization of algorithms and poor usability of data mining platforms. This paper focuses on parallel data mining algorithms and parallel data mining tools. Based on Spark as a programming model and processing engine, a distributed parallel data mining scheduling framework is designed and implemented based on Hadoop and Spark, which can meet the needs of users for mining and analyzing large data sets. The scheduling system implements common data mining algorithms such as classification, prediction, clustering and data preprocessing, and can complete data mining modeling by visual drag and drop algorithm program.","PeriodicalId":345703,"journal":{"name":"2022 International Symposium on Advances in Informatics, Electronics and Education (ISAIEE)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Research on Parallel Data Mining Based on Spark\",\"authors\":\"Jiali Shen\",\"doi\":\"10.1109/ISAIEE57420.2022.00033\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In the current era of big data, the rapid development of network technology and hardware equipment leads to exponential data growth. However, under the challenge of massive data, there are still some problems in the field of data mining, such as low efficiency of algorithm execution, insufficient parallel optimization of algorithms and poor usability of data mining platforms. This paper focuses on parallel data mining algorithms and parallel data mining tools. Based on Spark as a programming model and processing engine, a distributed parallel data mining scheduling framework is designed and implemented based on Hadoop and Spark, which can meet the needs of users for mining and analyzing large data sets. The scheduling system implements common data mining algorithms such as classification, prediction, clustering and data preprocessing, and can complete data mining modeling by visual drag and drop algorithm program.\",\"PeriodicalId\":345703,\"journal\":{\"name\":\"2022 International Symposium on Advances in Informatics, Electronics and Education (ISAIEE)\",\"volume\":\"10 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 International Symposium on Advances in Informatics, Electronics and Education (ISAIEE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISAIEE57420.2022.00033\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Symposium on Advances in Informatics, Electronics and Education (ISAIEE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISAIEE57420.2022.00033","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

在当前大数据时代,网络技术和硬件设备的快速发展导致数据呈指数级增长。然而,在海量数据的挑战下,数据挖掘领域仍然存在算法执行效率低、算法并行优化不足、数据挖掘平台可用性差等问题。本文主要研究并行数据挖掘算法和并行数据挖掘工具。以Spark作为编程模型和处理引擎,设计并实现了一个基于Hadoop和Spark的分布式并行数据挖掘调度框架,能够满足用户对大数据集挖掘和分析的需求。调度系统实现了常用的数据挖掘算法,如分类、预测、聚类和数据预处理等,并可以通过可视化的拖放算法程序完成数据挖掘建模。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Research on Parallel Data Mining Based on Spark
In the current era of big data, the rapid development of network technology and hardware equipment leads to exponential data growth. However, under the challenge of massive data, there are still some problems in the field of data mining, such as low efficiency of algorithm execution, insufficient parallel optimization of algorithms and poor usability of data mining platforms. This paper focuses on parallel data mining algorithms and parallel data mining tools. Based on Spark as a programming model and processing engine, a distributed parallel data mining scheduling framework is designed and implemented based on Hadoop and Spark, which can meet the needs of users for mining and analyzing large data sets. The scheduling system implements common data mining algorithms such as classification, prediction, clustering and data preprocessing, and can complete data mining modeling by visual drag and drop algorithm program.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信