{"title":"Comparison of Resampling Techniques for Imbalanced Datasets in Student Dropout Prediction","authors":"Sheikh Masood, S. Begum","doi":"10.1109/SILCON55242.2022.10028915","DOIUrl":null,"url":null,"abstract":"One of the challenges in the Student Dropout Prediction (SDP) problem is imbalanced data, which reduces the efficiency of the Machine Learning (ML) classifier when predicting dropout students. The disproportionate distribution of samples between the majority class (more samples) and the minority class (fewer samples) causes the class imbalance problem, which is a significant challenge in classification problems. When a dataset is highly imbalanced, the ML classifiers give high accuracy as they learn mostly from the majority class. Hence, the accuracy may not always give correct insight about the trained model. In this paper, the findings of the study of several resampling techniques for handling imbalanced data at the data preprocessing level are presented. The Machine learning algorithms, viz. Logistic Regression and Support Vector Machine (SVM), over different performance evaluation metrics for binary classification problems, have been used in the present study to predict the minority class. It is found that the Area Under Curve (AUC) score gives the most reliable result amongst the other considered metrics for predicting the minority class, i.e., the dropout rate of the students.","PeriodicalId":183947,"journal":{"name":"2022 IEEE Silchar Subsection Conference (SILCON)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE Silchar Subsection Conference (SILCON)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SILCON55242.2022.10028915","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
One of the challenges in the Student Dropout Prediction (SDP) problem is imbalanced data, which reduces the efficiency of the Machine Learning (ML) classifier when predicting dropout students. The disproportionate distribution of samples between the majority class (more samples) and the minority class (fewer samples) causes the class imbalance problem, which is a significant challenge in classification problems. When a dataset is highly imbalanced, the ML classifiers give high accuracy as they learn mostly from the majority class. Hence, the accuracy may not always give correct insight about the trained model. In this paper, the findings of the study of several resampling techniques for handling imbalanced data at the data preprocessing level are presented. The Machine learning algorithms, viz. Logistic Regression and Support Vector Machine (SVM), over different performance evaluation metrics for binary classification problems, have been used in the present study to predict the minority class. It is found that the Area Under Curve (AUC) score gives the most reliable result amongst the other considered metrics for predicting the minority class, i.e., the dropout rate of the students.