Performance evaluation of supervised learning algorithms with various training data sizes and missing attributes

Q3 Engineering

Engineering and Applied Science Research Pub Date : 2018-09-14 DOI:10.14456/EASR.2018.28

Chaluemwut Noyunsan, Tatpong Katanyukul, K. Saikaew

{"title":"Performance evaluation of supervised learning algorithms with various training data sizes and missing attributes","authors":"Chaluemwut Noyunsan, Tatpong Katanyukul, K. Saikaew","doi":"10.14456/EASR.2018.28","DOIUrl":null,"url":null,"abstract":"Supervised learning is a machine learning technique used for creating a data prediction model. This article focuses on finding high performance supervised learning algorithms with varied training data sizes, varied number of attributes, and time spent on prediction. This studied evaluated seven algorithms, Boosting, Random Forest, Bagging, Naive Bayes, K-Nearest Neighbours (K-NN), Decision Tree, and Support Vector Machine (SVM), on seven data sets that are the standard benchmark from University of California, Irvine (UCI) with two evaluation metrics and experimental settings of various training data sizes and missing key attributes. Our findings reveal that Bagging, Random Forest, and SVM are overall the three most accurate algorithms. However, when presence of key attribute values is of concern, K-NN is recommended as its performance is affected the least. Alternatively, when training data sizes may be not large enough, Naive Bayes is preferable since it is the most insensitive algorithm to training data sizes. The algorithms are characterized on a two-dimension chart based on prediction performance and computation time. This chart is expected to guide a novice user to choose an appropriate method for his/her demand. Based on this chart, in general, Bagging and Random Forest are the two most recommended algorithms because of their high performance and speed.","PeriodicalId":37310,"journal":{"name":"Engineering and Applied Science Research","volume":"45 1","pages":"221-229"},"PeriodicalIF":0.0000,"publicationDate":"2018-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Engineering and Applied Science Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.14456/EASR.2018.28","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Engineering","Score":null,"Total":0}

引用次数: 5

Abstract

Supervised learning is a machine learning technique used for creating a data prediction model. This article focuses on finding high performance supervised learning algorithms with varied training data sizes, varied number of attributes, and time spent on prediction. This studied evaluated seven algorithms, Boosting, Random Forest, Bagging, Naive Bayes, K-Nearest Neighbours (K-NN), Decision Tree, and Support Vector Machine (SVM), on seven data sets that are the standard benchmark from University of California, Irvine (UCI) with two evaluation metrics and experimental settings of various training data sizes and missing key attributes. Our findings reveal that Bagging, Random Forest, and SVM are overall the three most accurate algorithms. However, when presence of key attribute values is of concern, K-NN is recommended as its performance is affected the least. Alternatively, when training data sizes may be not large enough, Naive Bayes is preferable since it is the most insensitive algorithm to training data sizes. The algorithms are characterized on a two-dimension chart based on prediction performance and computation time. This chart is expected to guide a novice user to choose an appropriate method for his/her demand. Based on this chart, in general, Bagging and Random Forest are the two most recommended algorithms because of their high performance and speed.

查看原文本刊更多论文

具有不同训练数据大小和缺失属性的监督学习算法的性能评估

监督学习是一种用于创建数据预测模型的机器学习技术。本文的重点是寻找具有不同训练数据大小、不同属性数量和用于预测的时间的高性能监督学习算法。本研究评估了七种算法，Boosting, Random Forest, Bagging，朴素贝叶斯，k -近邻(K-NN)，决策树和支持向量机(SVM)，这些算法是来自加州大学欧文分校(UCI)的七个数据集的标准基准，具有两个评估指标和各种训练数据大小和缺失关键属性的实验设置。我们的研究结果表明，Bagging、Random Forest和SVM是总体上最准确的三种算法。然而，当关注关键属性值的存在时，推荐使用K-NN，因为它对性能的影响最小。或者，当训练数据的大小可能不够大时，朴素贝叶斯是更可取的，因为它是对训练数据大小最不敏感的算法。基于预测性能和计算时间，用二维图对算法进行表征。这张图表旨在指导新手用户根据自己的需求选择合适的方法。根据这张图表，总的来说，Bagging和Random Forest是两种最推荐的算法，因为它们的性能和速度都很高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Engineering and Applied Science Research Engineering-Engineering (all)

CiteScore

2.10

自引率

0.00%

发文量

审稿时长

11 weeks

期刊介绍： Publication of the journal started in 1974. Its original name was “KKU Engineering Journal”. English and Thai manuscripts were accepted. The journal was originally aimed at publishing research that was conducted and implemented in the northeast of Thailand. It is regarded a national journal and has been indexed in the Thai-journal Citation Index (TCI) database since 2004. The journal now accepts only English language manuscripts and became open-access in 2015 to attract more international readers. It was renamed Engineering and Applied Science Research in 2017. The editorial team agreed to publish more international papers, therefore, the new journal title is more appropriate. The journal focuses on research in the field of engineering that not only presents highly original ideas and advanced technology, but also are practical applications of appropriate technology.