Enhancing SMOTE for imbalanced data with abnormal minority instances

Surani Matharaarachchi , Mike Domaratzki , Saman Muthukumarana
{"title":"Enhancing SMOTE for imbalanced data with abnormal minority instances","authors":"Surani Matharaarachchi ,&nbsp;Mike Domaratzki ,&nbsp;Saman Muthukumarana","doi":"10.1016/j.mlwa.2024.100597","DOIUrl":null,"url":null,"abstract":"<div><div>Imbalanced datasets are frequent in machine learning, where certain classes are markedly underrepresented compared to others. This imbalance often results in sub-optimal model performance, as classifiers tend to favour the majority class. A significant challenge arises when abnormal instances, such as outliers, exist within the minority class, diminishing the effectiveness of traditional re-sampling methods like the Synthetic Minority Over-sampling Technique (SMOTE). This manuscript addresses this critical issue by introducing four SMOTE extensions: Distance ExtSMOTE, Dirichlet ExtSMOTE, FCRP SMOTE, and BGMM SMOTE. These methods leverage a weighted average of neighbouring instances to enhance the quality of synthetic samples and mitigate the impact of outliers. Comprehensive experiments conducted on diverse simulated and real-world imbalanced datasets demonstrate that the proposed methods improve classification performance compared to the original SMOTE and its most competitive variants. Notably, we demonstrate that Dirichlet ExtSMOTE outperforms most other proposed and existing SMOTE variants in terms of achieving better F1 score, MCC, and PR-AUC. Our results underscore the effectiveness of these advanced SMOTE extensions in tackling class imbalance, particularly in the presence of abnormal instances, offering robust solutions for real-world applications.</div></div>","PeriodicalId":74093,"journal":{"name":"Machine learning with applications","volume":"18 ","pages":"Article 100597"},"PeriodicalIF":0.0000,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Machine learning with applications","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666827024000732","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Imbalanced datasets are frequent in machine learning, where certain classes are markedly underrepresented compared to others. This imbalance often results in sub-optimal model performance, as classifiers tend to favour the majority class. A significant challenge arises when abnormal instances, such as outliers, exist within the minority class, diminishing the effectiveness of traditional re-sampling methods like the Synthetic Minority Over-sampling Technique (SMOTE). This manuscript addresses this critical issue by introducing four SMOTE extensions: Distance ExtSMOTE, Dirichlet ExtSMOTE, FCRP SMOTE, and BGMM SMOTE. These methods leverage a weighted average of neighbouring instances to enhance the quality of synthetic samples and mitigate the impact of outliers. Comprehensive experiments conducted on diverse simulated and real-world imbalanced datasets demonstrate that the proposed methods improve classification performance compared to the original SMOTE and its most competitive variants. Notably, we demonstrate that Dirichlet ExtSMOTE outperforms most other proposed and existing SMOTE variants in terms of achieving better F1 score, MCC, and PR-AUC. Our results underscore the effectiveness of these advanced SMOTE extensions in tackling class imbalance, particularly in the presence of abnormal instances, offering robust solutions for real-world applications.
针对异常少数实例的不平衡数据增强 SMOTE
不平衡数据集在机器学习中经常出现,与其他类别相比,某些类别的代表性明显不足。这种不平衡往往会导致模型性能达不到最优,因为分类器往往倾向于大多数类别。当少数类中存在异常实例(如离群值)时,就会出现一个重大挑战,从而降低传统再采样方法(如合成少数群体过度采样技术(SMOTE))的有效性。本手稿通过引入四种 SMOTE 扩展来解决这一关键问题:距离 ExtSMOTE、Dirichlet ExtSMOTE、FCRP SMOTE 和 BGMM SMOTE。这些方法利用相邻实例的加权平均来提高合成样本的质量,并减轻异常值的影响。在各种模拟和真实世界不平衡数据集上进行的综合实验表明,与原始 SMOTE 及其最具竞争力的变体相比,所提出的方法提高了分类性能。值得注意的是,我们证明了 Dirichlet ExtSMOTE 在获得更好的 F1 分数、MCC 和 PR-AUC 方面优于大多数其他提出的和现有的 SMOTE 变体。我们的研究结果凸显了这些先进的 SMOTE 扩展在解决类不平衡方面的有效性,尤其是在存在异常实例的情况下,为现实世界的应用提供了稳健的解决方案。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Machine learning with applications
Machine learning with applications Management Science and Operations Research, Artificial Intelligence, Computer Science Applications
自引率
0.00%
发文量
0
审稿时长
98 days
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信