Comparative Evaluation of Imbalanced Data Management Techniques for Solving Classification Problems on Imbalanced Datasets

Tanawan Watthaisong, K. Sunat, Nipotepat Muangkote
{"title":"Comparative Evaluation of Imbalanced Data Management Techniques for Solving Classification Problems on Imbalanced Datasets","authors":"Tanawan Watthaisong, K. Sunat, Nipotepat Muangkote","doi":"10.19139/soic-2310-5070-1890","DOIUrl":null,"url":null,"abstract":"Dealing with imbalanced data is crucial and challenging when developing effective machine-learning models for data classification purposes. It significantly impacts the classification model's performance without proper data management, leading to suboptimal results. Many methods for managing imbalanced data have been studied and developed to improve data balance. In this paper, we conduct a comparative study to assess the influence of a ranking technique on the evaluation of the effectiveness of 66 traditional methods for addressing imbalanced data. The three classification models, i.e., Decision Tree, Random Forest, and XGBoost, act as classification models. The experimental settings have been divided into two segments. The first part evaluates the performance of various imbalanced dataset handling methods, while the second part compares the performance of the top 4 oversampling methods. The study encompasses 50 separate datasets: 20 retrieved from the UCI repository and 30 sourced from the OpenML repository. The evaluation is based on F-Measure and statistical methods, including the Kruskal-Wallis test and Borda Count, to rank the data imbalance handling capabilities of the 66 methods. The SMOTE technique is the benchmark for comparison due to its popularity in handling imbalanced data. Based on the experimental results, the MCT, Polynom-fit-SMOTE, and CBSO methods were identified as the top three performers, demonstrating superior effectiveness in managing imbalanced datasets. This research could be beneficial and serve as a practical guide for practitioners to apply suitable techniques for data management.","PeriodicalId":131002,"journal":{"name":"Statistics, Optimization & Information Computing","volume":"158 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistics, Optimization & Information Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.19139/soic-2310-5070-1890","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Dealing with imbalanced data is crucial and challenging when developing effective machine-learning models for data classification purposes. It significantly impacts the classification model's performance without proper data management, leading to suboptimal results. Many methods for managing imbalanced data have been studied and developed to improve data balance. In this paper, we conduct a comparative study to assess the influence of a ranking technique on the evaluation of the effectiveness of 66 traditional methods for addressing imbalanced data. The three classification models, i.e., Decision Tree, Random Forest, and XGBoost, act as classification models. The experimental settings have been divided into two segments. The first part evaluates the performance of various imbalanced dataset handling methods, while the second part compares the performance of the top 4 oversampling methods. The study encompasses 50 separate datasets: 20 retrieved from the UCI repository and 30 sourced from the OpenML repository. The evaluation is based on F-Measure and statistical methods, including the Kruskal-Wallis test and Borda Count, to rank the data imbalance handling capabilities of the 66 methods. The SMOTE technique is the benchmark for comparison due to its popularity in handling imbalanced data. Based on the experimental results, the MCT, Polynom-fit-SMOTE, and CBSO methods were identified as the top three performers, demonstrating superior effectiveness in managing imbalanced datasets. This research could be beneficial and serve as a practical guide for practitioners to apply suitable techniques for data management.
解决不平衡数据集分类问题的不平衡数据管理技术比较评估
在开发用于数据分类的有效机器学习模型时,处理不平衡数据至关重要,也极具挑战性。如果没有适当的数据管理,它会严重影响分类模型的性能,从而导致不理想的结果。为了改善数据平衡,人们研究并开发了许多管理不平衡数据的方法。在本文中,我们进行了一项比较研究,以评估排序技术对 66 种处理不平衡数据的传统方法效果评估的影响。决策树、随机森林和 XGBoost 这三种分类模型作为分类模型。实验设置分为两个部分。第一部分评估各种不平衡数据集处理方法的性能,第二部分比较前 4 种过度采样方法的性能。研究包括 50 个独立的数据集:20 个从 UCI 数据库检索,30 个从 OpenML 数据库获取。评估基于 F-Measure 和统计方法,包括 Kruskal-Wallis 检验和 Borda 计数,对 66 种方法的数据不平衡处理能力进行排名。由于 SMOTE 技术在处理不平衡数据方面很受欢迎,因此成为比较的基准。根据实验结果,MCT、Polynom-fit-SMOTE 和 CBSO 方法被确定为表现最出色的三种方法,在管理不平衡数据集方面表现出卓越的功效。这项研究可以为从业人员应用合适的数据管理技术提供有益的实践指导。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信