Resampling Technique for Imbalanced Class Handling on Educational Dataset

JUITA : Jurnal Informatika Pub Date : 2023-05-06 DOI:10.30595/juita.v11i1.15498

Anief Fauzan Rozi, Adi Wibowo, B. Warsito

{"title":"Resampling Technique for Imbalanced Class Handling on Educational Dataset","authors":"Anief Fauzan Rozi, Adi Wibowo, B. Warsito","doi":"10.30595/juita.v11i1.15498","DOIUrl":null,"url":null,"abstract":"Educational data mining is an emerging field in data mining. The need for accurate in identifying student accomplishment on a course or maybe an upcoming course can help the institution to build technology-aided education better. Educational data mining becoming a more important field to be studied because of its potential to produce a knowledge base model to help even the teacher or lecturer. Like another classification task, educational data mining has a common and frequently discovered problem. The problem that occurred in educational data mining specifically and classification tasks generally is an imbalanced class problem. An imbalanced class is a condition where the distribution of each class is not in the same proportion. In this research, it is found that the class distribution is severely imbalanced and it is a multiclass dataset that consists of more than two class labels. According to the problem stated beforehand, this paper will focus on the imbalanced class handling and classification with several methods on both of it such as Linear Regression, Random Forest and Stacking for classification and SMOTE, ADASYN, and SMOTE-ENN for the resampling algorithm. The methods are being evaluated using a 10-fold cross-validation and an 80-20 splitting ratio. The result shows that the best performance coming from the Stacking classification on ADASYN resampled dataset evaluated using an 80-20 splitting ratio with a 0.97 F1 score. The result of this study also shows that the resampling technique improves classification performance. Even though the no-resampling classification result produced a decent result too, it can be caused by several things such as the general pattern of the data for each class is already been good from the start. Thus, there is no real drawbacks if the original data is processed.","PeriodicalId":151254,"journal":{"name":"JUITA : Jurnal Informatika","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JUITA : Jurnal Informatika","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.30595/juita.v11i1.15498","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Educational data mining is an emerging field in data mining. The need for accurate in identifying student accomplishment on a course or maybe an upcoming course can help the institution to build technology-aided education better. Educational data mining becoming a more important field to be studied because of its potential to produce a knowledge base model to help even the teacher or lecturer. Like another classification task, educational data mining has a common and frequently discovered problem. The problem that occurred in educational data mining specifically and classification tasks generally is an imbalanced class problem. An imbalanced class is a condition where the distribution of each class is not in the same proportion. In this research, it is found that the class distribution is severely imbalanced and it is a multiclass dataset that consists of more than two class labels. According to the problem stated beforehand, this paper will focus on the imbalanced class handling and classification with several methods on both of it such as Linear Regression, Random Forest and Stacking for classification and SMOTE, ADASYN, and SMOTE-ENN for the resampling algorithm. The methods are being evaluated using a 10-fold cross-validation and an 80-20 splitting ratio. The result shows that the best performance coming from the Stacking classification on ADASYN resampled dataset evaluated using an 80-20 splitting ratio with a 0.97 F1 score. The result of this study also shows that the resampling technique improves classification performance. Even though the no-resampling classification result produced a decent result too, it can be caused by several things such as the general pattern of the data for each class is already been good from the start. Thus, there is no real drawbacks if the original data is processed.

查看原文本刊更多论文

教育数据集不平衡类处理的重采样技术

教育数据挖掘是数据挖掘中的一个新兴领域。需要准确地识别学生在课程或即将到来的课程中的成就，可以帮助机构更好地建立技术辅助教育。教育数据挖掘成为一个更重要的研究领域，因为它有可能产生一个知识库模型，甚至可以帮助教师或讲师。与其他分类任务一样，教育数据挖掘也存在一个常见且经常被发现的问题。在教育数据挖掘和分类任务中普遍存在的问题是一个不平衡类问题。不平衡的班级是指每个班级的分配比例不一致的情况。在本研究中，我们发现类分布严重不平衡，它是一个由两个以上的类标签组成的多类数据集。针对上述问题，本文将重点研究不平衡类的处理和分类，并采用几种方法进行分类，如线性回归、随机森林和堆叠，重采样算法采用SMOTE、ADASYN和SMOTE- enn。采用10倍交叉验证和80-20分割比对方法进行评价。结果表明，在ADASYN重采样数据集上，使用80-20的分割比评估堆叠分类的最佳性能，F1得分为0.97。研究结果还表明，重采样技术提高了分类性能。即使不重新采样的分类结果也产生了不错的结果，但它可能是由几个因素造成的，比如每个类的数据的一般模式从一开始就已经很好了。因此，如果对原始数据进行处理，就不会有真正的缺点。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

JUITA : Jurnal Informatika

自引率

0.00%

发文量