The Effect of Class Imbalance Handling on Datasets Toward Classification Algorithm Performance

MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer Pub Date : 2023-03-01 DOI:10.30812/matrik.v22i2.2515

Cherfly Kaope, Yoga Pristyanto

{"title":"The Effect of Class Imbalance Handling on Datasets Toward Classification Algorithm Performance","authors":"Cherfly Kaope, Yoga Pristyanto","doi":"10.30812/matrik.v22i2.2515","DOIUrl":null,"url":null,"abstract":"Class imbalance is a condition where the amount of data in the minority class is smaller than that of the majority class. The impact of the class imbalance in the dataset is the occurrence of minority class misclassification, so it can affect classification performance. Various approaches have been taken to deal with the problem of class imbalances such as the data level approach, algorithmic level approach, and cost-sensitive learning. At the data level, one of the methods used is to apply the sampling method. In this study, the ADASYN, SMOTE, and SMOTE-ENN sampling methods were used to deal with the problem of class imbalance combined with the AdaBoost, K-Nearest Neighbor, and Random Forest classification algorithms. The purpose of this study was to determine the effect of handling class imbalances on the dataset on classification performance. The tests were carried out on five datasets and based on the results of the classification the integration of the ADASYN and Random Forest methods gave better results compared to other model schemes. The criteria used to evaluate include accuracy, precision, true positive rate, true negative rate, and g-mean score. The results of the classification of the integration of the ADASYN and Random Forest methods gave 5% to 10% better than other models.","PeriodicalId":364657,"journal":{"name":"MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer","volume":"520 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.30812/matrik.v22i2.2515","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Class imbalance is a condition where the amount of data in the minority class is smaller than that of the majority class. The impact of the class imbalance in the dataset is the occurrence of minority class misclassification, so it can affect classification performance. Various approaches have been taken to deal with the problem of class imbalances such as the data level approach, algorithmic level approach, and cost-sensitive learning. At the data level, one of the methods used is to apply the sampling method. In this study, the ADASYN, SMOTE, and SMOTE-ENN sampling methods were used to deal with the problem of class imbalance combined with the AdaBoost, K-Nearest Neighbor, and Random Forest classification algorithms. The purpose of this study was to determine the effect of handling class imbalances on the dataset on classification performance. The tests were carried out on five datasets and based on the results of the classification the integration of the ADASYN and Random Forest methods gave better results compared to other model schemes. The criteria used to evaluate include accuracy, precision, true positive rate, true negative rate, and g-mean score. The results of the classification of the integration of the ADASYN and Random Forest methods gave 5% to 10% better than other models.

查看原文本刊更多论文

类不平衡处理对数据集分类算法性能的影响

类不平衡是指少数类的数据量小于多数类的数据量。类不平衡对数据集的影响是少数类误分类的发生，因此会影响分类性能。已经采取了各种方法来处理类不平衡问题，如数据级方法、算法级方法和成本敏感学习。在数据层面，使用的方法之一是应用抽样方法。本研究采用ADASYN、SMOTE和SMOTE- enn采样方法，结合AdaBoost、k近邻和随机森林分类算法来处理类不平衡问题。本研究的目的是确定处理数据集上的类不平衡对分类性能的影响。在5个数据集上进行了测试，根据分类结果，ADASYN和Random Forest方法相结合的结果优于其他模型方案。评估标准包括准确性、精密度、真阳性率、真阴性率和g-平均评分。结合ADASYN和Random Forest方法的分类结果比其他模型的分类结果好5% ~ 10%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer

自引率

0.00%

发文量