Genaro Camele, W. Hasperué, Franco Ronchetti, F. Quiroga
{"title":"统计分析四种Apache Spark ML算法的性能","authors":"Genaro Camele, W. Hasperué, Franco Ronchetti, F. Quiroga","doi":"10.24215/16666038.22.e14","DOIUrl":null,"url":null,"abstract":"Feature selection (FS) techniques generally require repeatedly training and evaluating models to assess theimportance of each feature for a particular task. However, due to the increasing size of currently availabledatabases, distributed processing has become a necessity for many tasks. In this context, the Apache SparkML library is one of the most widely used libraries for performing classification and other tasks with largedatasets. Therefore, knowing both the predictive performance and efficiency of its main algorithms beforeapplying a FS technique is crucial to planning computations and saving time. In this work, a comparativestudy of four Spark ML classification algorithms is carried out, statistically measuring execution times andpredictive power based on the number of attributes from a colon cancer database. Results were statistically analyzed, showing that, although Random Forest and Na¨ıve Bayes are the algorithms with the shortest execution times, Support Vector Machine obtains models with the best predictive power. The study of the performance of these algorithms is interesting as they are applied in many different problems, such as classification of pathologies from epigenomic data, image classification, prediction of computer attacks in network security problems, among others.","PeriodicalId":188846,"journal":{"name":"J. Comput. Sci. Technol.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Statistical analysis of the performance of four Apache Spark ML algorithms\",\"authors\":\"Genaro Camele, W. Hasperué, Franco Ronchetti, F. Quiroga\",\"doi\":\"10.24215/16666038.22.e14\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Feature selection (FS) techniques generally require repeatedly training and evaluating models to assess theimportance of each feature for a particular task. However, due to the increasing size of currently availabledatabases, distributed processing has become a necessity for many tasks. In this context, the Apache SparkML library is one of the most widely used libraries for performing classification and other tasks with largedatasets. Therefore, knowing both the predictive performance and efficiency of its main algorithms beforeapplying a FS technique is crucial to planning computations and saving time. In this work, a comparativestudy of four Spark ML classification algorithms is carried out, statistically measuring execution times andpredictive power based on the number of attributes from a colon cancer database. Results were statistically analyzed, showing that, although Random Forest and Na¨ıve Bayes are the algorithms with the shortest execution times, Support Vector Machine obtains models with the best predictive power. The study of the performance of these algorithms is interesting as they are applied in many different problems, such as classification of pathologies from epigenomic data, image classification, prediction of computer attacks in network security problems, among others.\",\"PeriodicalId\":188846,\"journal\":{\"name\":\"J. Comput. Sci. Technol.\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"J. Comput. Sci. Technol.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.24215/16666038.22.e14\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"J. Comput. Sci. Technol.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.24215/16666038.22.e14","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Statistical analysis of the performance of four Apache Spark ML algorithms
Feature selection (FS) techniques generally require repeatedly training and evaluating models to assess theimportance of each feature for a particular task. However, due to the increasing size of currently availabledatabases, distributed processing has become a necessity for many tasks. In this context, the Apache SparkML library is one of the most widely used libraries for performing classification and other tasks with largedatasets. Therefore, knowing both the predictive performance and efficiency of its main algorithms beforeapplying a FS technique is crucial to planning computations and saving time. In this work, a comparativestudy of four Spark ML classification algorithms is carried out, statistically measuring execution times andpredictive power based on the number of attributes from a colon cancer database. Results were statistically analyzed, showing that, although Random Forest and Na¨ıve Bayes are the algorithms with the shortest execution times, Support Vector Machine obtains models with the best predictive power. The study of the performance of these algorithms is interesting as they are applied in many different problems, such as classification of pathologies from epigenomic data, image classification, prediction of computer attacks in network security problems, among others.