统计分析四种Apache Spark ML算法的性能

J. Comput. Sci. Technol. Pub Date : 2022-10-17 DOI:10.24215/16666038.22.e14

Genaro Camele, W. Hasperué, Franco Ronchetti, F. Quiroga

{"title":"统计分析四种Apache Spark ML算法的性能","authors":"Genaro Camele, W. Hasperué, Franco Ronchetti, F. Quiroga","doi":"10.24215/16666038.22.e14","DOIUrl":null,"url":null,"abstract":"Feature selection (FS) techniques generally require repeatedly training and evaluating models to assess theimportance of each feature for a particular task. However, due to the increasing size of currently availabledatabases, distributed processing has become a necessity for many tasks. In this context, the Apache SparkML library is one of the most widely used libraries for performing classification and other tasks with largedatasets. Therefore, knowing both the predictive performance and efficiency of its main algorithms beforeapplying a FS technique is crucial to planning computations and saving time. In this work, a comparativestudy of four Spark ML classification algorithms is carried out, statistically measuring execution times andpredictive power based on the number of attributes from a colon cancer database. Results were statistically analyzed, showing that, although Random Forest and Na¨ıve Bayes are the algorithms with the shortest execution times, Support Vector Machine obtains models with the best predictive power. The study of the performance of these algorithms is interesting as they are applied in many different problems, such as classification of pathologies from epigenomic data, image classification, prediction of computer attacks in network security problems, among others.","PeriodicalId":188846,"journal":{"name":"J. Comput. Sci. Technol.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Statistical analysis of the performance of four Apache Spark ML algorithms\",\"authors\":\"Genaro Camele, W. Hasperué, Franco Ronchetti, F. Quiroga\",\"doi\":\"10.24215/16666038.22.e14\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Feature selection (FS) techniques generally require repeatedly training and evaluating models to assess theimportance of each feature for a particular task. However, due to the increasing size of currently availabledatabases, distributed processing has become a necessity for many tasks. In this context, the Apache SparkML library is one of the most widely used libraries for performing classification and other tasks with largedatasets. Therefore, knowing both the predictive performance and efficiency of its main algorithms beforeapplying a FS technique is crucial to planning computations and saving time. In this work, a comparativestudy of four Spark ML classification algorithms is carried out, statistically measuring execution times andpredictive power based on the number of attributes from a colon cancer database. Results were statistically analyzed, showing that, although Random Forest and Na¨ıve Bayes are the algorithms with the shortest execution times, Support Vector Machine obtains models with the best predictive power. The study of the performance of these algorithms is interesting as they are applied in many different problems, such as classification of pathologies from epigenomic data, image classification, prediction of computer attacks in network security problems, among others.\",\"PeriodicalId\":188846,\"journal\":{\"name\":\"J. Comput. Sci. Technol.\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"J. Comput. Sci. Technol.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.24215/16666038.22.e14\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"J. Comput. Sci. Technol.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.24215/16666038.22.e14","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

特征选择(FS)技术通常需要反复训练和评估模型，以评估每个特征对特定任务的重要性。然而，由于当前可用数据库的规模越来越大，分布式处理已成为许多任务的必要条件。在这种情况下，Apache SparkML库是使用最广泛的库之一，用于对大型数据集执行分类和其他任务。因此，在应用FS技术之前，了解其主要算法的预测性能和效率对于规划计算和节省时间至关重要。在这项工作中，对四种Spark ML分类算法进行了比较研究，基于结肠癌数据库的属性数量统计测量执行时间和预测能力。统计分析结果表明，虽然随机森林和纳¨ıve贝叶斯是执行时间最短的算法，但支持向量机获得的模型预测能力最好。研究这些算法的性能是很有趣的，因为它们被应用于许多不同的问题，例如从表观基因组数据中对病理进行分类，图像分类，预测网络安全问题中的计算机攻击等等。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Statistical analysis of the performance of four Apache Spark ML algorithms

Feature selection (FS) techniques generally require repeatedly training and evaluating models to assess theimportance of each feature for a particular task. However, due to the increasing size of currently availabledatabases, distributed processing has become a necessity for many tasks. In this context, the Apache SparkML library is one of the most widely used libraries for performing classification and other tasks with largedatasets. Therefore, knowing both the predictive performance and efficiency of its main algorithms beforeapplying a FS technique is crucial to planning computations and saving time. In this work, a comparativestudy of four Spark ML classification algorithms is carried out, statistically measuring execution times andpredictive power based on the number of attributes from a colon cancer database. Results were statistically analyzed, showing that, although Random Forest and Na¨ıve Bayes are the algorithms with the shortest execution times, Support Vector Machine obtains models with the best predictive power. The study of the performance of these algorithms is interesting as they are applied in many different problems, such as classification of pathologies from epigenomic data, image classification, prediction of computer attacks in network security problems, among others.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

J. Comput. Sci. Technol.

自引率

0.00%

发文量