A parallel and balanced SVM algorithm on spark for data-intensive computing

IF 0.8 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Intelligent Data Analysis Pub Date : 2023-06-01 DOI:10.3233/ida-226774

Jianjiang Li, Jinliang Shi, Zhiguo Liu, Can Feng

{"title":"A parallel and balanced SVM algorithm on spark for data-intensive computing","authors":"Jianjiang Li, Jinliang Shi, Zhiguo Liu, Can Feng","doi":"10.3233/ida-226774","DOIUrl":null,"url":null,"abstract":"Support Vector Machine (SVM) is a machine learning with excellent classification performance, which has been widely used in various fields such as data mining, text classification, face recognition and etc. However, when data volume scales to a certain level, the computational time becomes too long and the efficiency becomes low. To address this issue, we propose a parallel balanced SVM algorithm based on Spark, named PB-SVM, which is optimized on the basis of the traditional Cascade SVM algorithm. PB-SVM contains three parts, i.e., Clustering Equal Division, Balancing Shuffle and Iteration Termination, which solves the problems of data skew of Cascade SVM and the large difference between local support vector and global support vector. We implement PB-SVM in AliCloud Spark distributed cluster with five kinds of public datasets. Our experimental results show that in the two-classification test on the dataset covtype, compared with MLlib-SVM and Cascade SVM on Spark, PB-SVM improves efficiency by 38.9% and 75.4%, and the accuracy is improved by 7.16% and 8.38%. Moreover, in the multi-classification test, compared with Cascade SVM on Spark on the dataset covtype, PB-SVM improves efficiency and accuracy by 94.8% and 18.26% respectively.","PeriodicalId":50355,"journal":{"name":"Intelligent Data Analysis","volume":"58 1","pages":"1065-1086"},"PeriodicalIF":0.8000,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Intelligent Data Analysis","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.3233/ida-226774","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Support Vector Machine (SVM) is a machine learning with excellent classification performance, which has been widely used in various fields such as data mining, text classification, face recognition and etc. However, when data volume scales to a certain level, the computational time becomes too long and the efficiency becomes low. To address this issue, we propose a parallel balanced SVM algorithm based on Spark, named PB-SVM, which is optimized on the basis of the traditional Cascade SVM algorithm. PB-SVM contains three parts, i.e., Clustering Equal Division, Balancing Shuffle and Iteration Termination, which solves the problems of data skew of Cascade SVM and the large difference between local support vector and global support vector. We implement PB-SVM in AliCloud Spark distributed cluster with five kinds of public datasets. Our experimental results show that in the two-classification test on the dataset covtype, compared with MLlib-SVM and Cascade SVM on Spark, PB-SVM improves efficiency by 38.9% and 75.4%, and the accuracy is improved by 7.16% and 8.38%. Moreover, in the multi-classification test, compared with Cascade SVM on Spark on the dataset covtype, PB-SVM improves efficiency and accuracy by 94.8% and 18.26% respectively.

查看原文本刊更多论文

基于spark的支持向量机并行平衡算法

支持向量机(SVM)是一种具有优异分类性能的机器学习方法，在数据挖掘、文本分类、人脸识别等各个领域得到了广泛的应用。但是，当数据量达到一定规模时，计算时间过长，效率低下。针对这一问题，本文提出了一种基于Spark的并行平衡支持向量机算法PB-SVM，该算法在传统的级联支持向量机算法的基础上进行了优化。PB-SVM包含聚类等分、平衡Shuffle和迭代终止三部分，解决了级联支持向量机的数据倾斜问题以及局部支持向量与全局支持向量差异大的问题。我们在阿里云Spark分布式集群中使用五种公共数据集实现了PB-SVM。实验结果表明，在数据集covtype的两类分类测试中，与Spark上的MLlib-SVM和Cascade -SVM相比，PB-SVM的效率分别提高了38.9%和75.4%，准确率分别提高了7.16%和8.38%。此外，在多重分类测试中，在数据集cov类型上，PB-SVM与Spark上的Cascade SVM相比，效率和准确率分别提高了94.8%和18.26%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Intelligent Data Analysis 工程技术-计算机：人工智能

CiteScore

2.20

自引率

5.90%

发文量

审稿时长

3.3 months

期刊介绍： Intelligent Data Analysis provides a forum for the examination of issues related to the research and applications of Artificial Intelligence techniques in data analysis across a variety of disciplines. These techniques include (but are not limited to): all areas of data visualization, data pre-processing (fusion, editing, transformation, filtering, sampling), data engineering, database mining techniques, tools and applications, use of domain knowledge in data analysis, big data applications, evolutionary algorithms, machine learning, neural nets, fuzzy logic, statistical pattern recognition, knowledge filtering, and post-processing. In particular, papers are preferred that discuss development of new AI related data analysis architectures, methodologies, and techniques and their applications to various domains.