{"title":"An Analysis of The Small Sample Datasets Based on Machine Learning","authors":"Shaoxuan Zhou","doi":"10.1145/3573428.3573720","DOIUrl":null,"url":null,"abstract":"In machine learning, building the optimal model for small sample data has become a widespread issue in the data science community. Some methods have been proven to achieve high accuracy in training small sample datasets. However, the solution to more extreme minor sample problems still lacks further exploration. Therefore, this paper will explore the prediction accuracy of machine learning methods for small sample datasets. Collecting the forest fire dataset and pulsar dataset from Kaggle as examples, the prediction of various machine learning models (SVM, random forest, neural networks, regression) was carried out, respectively. It was found that the machine learning model failed to achieve high prediction accuracy in the imbalanced samples represented by the forest fire dataset. Because of the small number and the imbalanced distribution, the model cannot obtain an apparent discrimination degree for each feature. To summarize, the prediction of small sample datasets needs to adopt better methods in model building and obtain more cases in data collection. Otherwise, machine learning cannot provide much help to the actual situation.","PeriodicalId":314698,"journal":{"name":"Proceedings of the 2022 6th International Conference on Electronic Information Technology and Computer Engineering","volume":"65 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 6th International Conference on Electronic Information Technology and Computer Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3573428.3573720","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In machine learning, building the optimal model for small sample data has become a widespread issue in the data science community. Some methods have been proven to achieve high accuracy in training small sample datasets. However, the solution to more extreme minor sample problems still lacks further exploration. Therefore, this paper will explore the prediction accuracy of machine learning methods for small sample datasets. Collecting the forest fire dataset and pulsar dataset from Kaggle as examples, the prediction of various machine learning models (SVM, random forest, neural networks, regression) was carried out, respectively. It was found that the machine learning model failed to achieve high prediction accuracy in the imbalanced samples represented by the forest fire dataset. Because of the small number and the imbalanced distribution, the model cannot obtain an apparent discrimination degree for each feature. To summarize, the prediction of small sample datasets needs to adopt better methods in model building and obtain more cases in data collection. Otherwise, machine learning cannot provide much help to the actual situation.