Chaluemwut Noyunsan, Tatpong Katanyukul, K. Saikaew
{"title":"Performance evaluation of supervised learning algorithms with various training data sizes and missing attributes","authors":"Chaluemwut Noyunsan, Tatpong Katanyukul, K. Saikaew","doi":"10.14456/EASR.2018.28","DOIUrl":null,"url":null,"abstract":"Supervised learning is a machine learning technique used for creating a data prediction model. This article focuses on finding high performance supervised learning algorithms with varied training data sizes, varied number of attributes, and time spent on prediction. This studied evaluated seven algorithms, Boosting, Random Forest, Bagging, Naive Bayes, K-Nearest Neighbours (K-NN), Decision Tree, and Support Vector Machine (SVM), on seven data sets that are the standard benchmark from University of California, Irvine (UCI) with two evaluation metrics and experimental settings of various training data sizes and missing key attributes. Our findings reveal that Bagging, Random Forest, and SVM are overall the three most accurate algorithms. However, when presence of key attribute values is of concern, K-NN is recommended as its performance is affected the least. Alternatively, when training data sizes may be not large enough, Naive Bayes is preferable since it is the most insensitive algorithm to training data sizes. The algorithms are characterized on a two-dimension chart based on prediction performance and computation time. This chart is expected to guide a novice user to choose an appropriate method for his/her demand. Based on this chart, in general, Bagging and Random Forest are the two most recommended algorithms because of their high performance and speed.","PeriodicalId":37310,"journal":{"name":"Engineering and Applied Science Research","volume":"45 1","pages":"221-229"},"PeriodicalIF":0.0000,"publicationDate":"2018-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Engineering and Applied Science Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.14456/EASR.2018.28","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Engineering","Score":null,"Total":0}
引用次数: 5
Abstract
Supervised learning is a machine learning technique used for creating a data prediction model. This article focuses on finding high performance supervised learning algorithms with varied training data sizes, varied number of attributes, and time spent on prediction. This studied evaluated seven algorithms, Boosting, Random Forest, Bagging, Naive Bayes, K-Nearest Neighbours (K-NN), Decision Tree, and Support Vector Machine (SVM), on seven data sets that are the standard benchmark from University of California, Irvine (UCI) with two evaluation metrics and experimental settings of various training data sizes and missing key attributes. Our findings reveal that Bagging, Random Forest, and SVM are overall the three most accurate algorithms. However, when presence of key attribute values is of concern, K-NN is recommended as its performance is affected the least. Alternatively, when training data sizes may be not large enough, Naive Bayes is preferable since it is the most insensitive algorithm to training data sizes. The algorithms are characterized on a two-dimension chart based on prediction performance and computation time. This chart is expected to guide a novice user to choose an appropriate method for his/her demand. Based on this chart, in general, Bagging and Random Forest are the two most recommended algorithms because of their high performance and speed.
期刊介绍:
Publication of the journal started in 1974. Its original name was “KKU Engineering Journal”. English and Thai manuscripts were accepted. The journal was originally aimed at publishing research that was conducted and implemented in the northeast of Thailand. It is regarded a national journal and has been indexed in the Thai-journal Citation Index (TCI) database since 2004. The journal now accepts only English language manuscripts and became open-access in 2015 to attract more international readers. It was renamed Engineering and Applied Science Research in 2017. The editorial team agreed to publish more international papers, therefore, the new journal title is more appropriate. The journal focuses on research in the field of engineering that not only presents highly original ideas and advanced technology, but also are practical applications of appropriate technology.