Statistical analysis of the performance of four Apache Spark ML algorithms

Genaro Camele, W. Hasperué, Franco Ronchetti, F. Quiroga
{"title":"Statistical analysis of the performance of four Apache Spark ML algorithms","authors":"Genaro Camele, W. Hasperué, Franco Ronchetti, F. Quiroga","doi":"10.24215/16666038.22.e14","DOIUrl":null,"url":null,"abstract":"Feature selection (FS) techniques generally require repeatedly training and evaluating models to assess theimportance of each feature for a particular task. However, due to the increasing size of currently availabledatabases, distributed processing has become a necessity for many tasks. In this context, the Apache SparkML library is one of the most widely used libraries for performing classification and other tasks with largedatasets. Therefore, knowing both the predictive performance and efficiency of its main algorithms beforeapplying a FS technique is crucial to planning computations and saving time. In this work, a comparativestudy of four Spark ML classification algorithms is carried out, statistically measuring execution times andpredictive power based on the number of attributes from a colon cancer database. Results were statistically analyzed, showing that, although Random Forest and Na¨ıve Bayes are the algorithms with the shortest execution times, Support Vector Machine obtains models with the best predictive power. The study of the performance of these algorithms is interesting as they are applied in many different problems, such as classification of pathologies from epigenomic data, image classification, prediction of computer attacks in network security problems, among others.","PeriodicalId":188846,"journal":{"name":"J. Comput. Sci. Technol.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"J. Comput. Sci. Technol.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.24215/16666038.22.e14","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Feature selection (FS) techniques generally require repeatedly training and evaluating models to assess theimportance of each feature for a particular task. However, due to the increasing size of currently availabledatabases, distributed processing has become a necessity for many tasks. In this context, the Apache SparkML library is one of the most widely used libraries for performing classification and other tasks with largedatasets. Therefore, knowing both the predictive performance and efficiency of its main algorithms beforeapplying a FS technique is crucial to planning computations and saving time. In this work, a comparativestudy of four Spark ML classification algorithms is carried out, statistically measuring execution times andpredictive power based on the number of attributes from a colon cancer database. Results were statistically analyzed, showing that, although Random Forest and Na¨ıve Bayes are the algorithms with the shortest execution times, Support Vector Machine obtains models with the best predictive power. The study of the performance of these algorithms is interesting as they are applied in many different problems, such as classification of pathologies from epigenomic data, image classification, prediction of computer attacks in network security problems, among others.
统计分析四种Apache Spark ML算法的性能
特征选择(FS)技术通常需要反复训练和评估模型,以评估每个特征对特定任务的重要性。然而,由于当前可用数据库的规模越来越大,分布式处理已成为许多任务的必要条件。在这种情况下,Apache SparkML库是使用最广泛的库之一,用于对大型数据集执行分类和其他任务。因此,在应用FS技术之前,了解其主要算法的预测性能和效率对于规划计算和节省时间至关重要。在这项工作中,对四种Spark ML分类算法进行了比较研究,基于结肠癌数据库的属性数量统计测量执行时间和预测能力。统计分析结果表明,虽然随机森林和纳¨ıve贝叶斯是执行时间最短的算法,但支持向量机获得的模型预测能力最好。研究这些算法的性能是很有趣的,因为它们被应用于许多不同的问题,例如从表观基因组数据中对病理进行分类,图像分类,预测网络安全问题中的计算机攻击等等。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信