Muhammad Arham Tariq, Allah Bux Sargano, Muhammad Aksam Iftikhar, Z. Habib
{"title":"比较使用机器学习技术预测多类教育数据集的不同过度取样方法","authors":"Muhammad Arham Tariq, Allah Bux Sargano, Muhammad Aksam Iftikhar, Z. Habib","doi":"10.2478/cait-2023-0044","DOIUrl":null,"url":null,"abstract":"Abstract Predicting students’ academic performance is a critical research area, yet imbalanced educational datasets, characterized by unequal academic-level representation, present challenges for classifiers. While prior research has addressed the imbalance in binary-class datasets, this study focuses on multi-class datasets. A comparison of ten resampling methods (SMOTE, Adasyn, Distance SMOTE, BorderLineSMOTE, KmeansSMOTE, SVMSMOTE, LN SMOTE, MWSMOTE, Safe Level SMOTE, and SMOTETomek) is conducted alongside nine classification models: K-Nearest Neighbors (KNN), Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), Support Vector Machine (SVM), Logistic Regression (LR), Extra Tree (ET), Random Forest (RT), Extreme Gradient Boosting (XGB), and Ada Boost (AdaB). Following a rigorous evaluation, including hyperparameter tuning and 10 fold cross-validations, KNN with SmoteTomek attains the highest accuracy of 83.7%, as demonstrated through an ablation study. These results emphasize SMOTETomek’s effectiveness in mitigating class imbalance in educational datasets and highlight KNN’s potential as an educational data mining classifier.","PeriodicalId":45562,"journal":{"name":"Cybernetics and Information Technologies","volume":null,"pages":null},"PeriodicalIF":1.2000,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Comparing Different Oversampling Methods in Predicting Multi-Class Educational Datasets Using Machine Learning Techniques\",\"authors\":\"Muhammad Arham Tariq, Allah Bux Sargano, Muhammad Aksam Iftikhar, Z. Habib\",\"doi\":\"10.2478/cait-2023-0044\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Abstract Predicting students’ academic performance is a critical research area, yet imbalanced educational datasets, characterized by unequal academic-level representation, present challenges for classifiers. While prior research has addressed the imbalance in binary-class datasets, this study focuses on multi-class datasets. A comparison of ten resampling methods (SMOTE, Adasyn, Distance SMOTE, BorderLineSMOTE, KmeansSMOTE, SVMSMOTE, LN SMOTE, MWSMOTE, Safe Level SMOTE, and SMOTETomek) is conducted alongside nine classification models: K-Nearest Neighbors (KNN), Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), Support Vector Machine (SVM), Logistic Regression (LR), Extra Tree (ET), Random Forest (RT), Extreme Gradient Boosting (XGB), and Ada Boost (AdaB). Following a rigorous evaluation, including hyperparameter tuning and 10 fold cross-validations, KNN with SmoteTomek attains the highest accuracy of 83.7%, as demonstrated through an ablation study. These results emphasize SMOTETomek’s effectiveness in mitigating class imbalance in educational datasets and highlight KNN’s potential as an educational data mining classifier.\",\"PeriodicalId\":45562,\"journal\":{\"name\":\"Cybernetics and Information Technologies\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":1.2000,\"publicationDate\":\"2023-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Cybernetics and Information Technologies\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2478/cait-2023-0044\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cybernetics and Information Technologies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2478/cait-2023-0044","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Comparing Different Oversampling Methods in Predicting Multi-Class Educational Datasets Using Machine Learning Techniques
Abstract Predicting students’ academic performance is a critical research area, yet imbalanced educational datasets, characterized by unequal academic-level representation, present challenges for classifiers. While prior research has addressed the imbalance in binary-class datasets, this study focuses on multi-class datasets. A comparison of ten resampling methods (SMOTE, Adasyn, Distance SMOTE, BorderLineSMOTE, KmeansSMOTE, SVMSMOTE, LN SMOTE, MWSMOTE, Safe Level SMOTE, and SMOTETomek) is conducted alongside nine classification models: K-Nearest Neighbors (KNN), Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), Support Vector Machine (SVM), Logistic Regression (LR), Extra Tree (ET), Random Forest (RT), Extreme Gradient Boosting (XGB), and Ada Boost (AdaB). Following a rigorous evaluation, including hyperparameter tuning and 10 fold cross-validations, KNN with SmoteTomek attains the highest accuracy of 83.7%, as demonstrated through an ablation study. These results emphasize SMOTETomek’s effectiveness in mitigating class imbalance in educational datasets and highlight KNN’s potential as an educational data mining classifier.