Enhancing SMOTE for imbalanced data with abnormal minority instances

Machine learning with applications Pub Date : 2024-10-29 DOI:10.1016/j.mlwa.2024.100597

Surani Matharaarachchi , Mike Domaratzki , Saman Muthukumarana

{"title":"Enhancing SMOTE for imbalanced data with abnormal minority instances","authors":"Surani Matharaarachchi , Mike Domaratzki , Saman Muthukumarana","doi":"10.1016/j.mlwa.2024.100597","DOIUrl":null,"url":null,"abstract":"<div><div>Imbalanced datasets are frequent in machine learning, where certain classes are markedly underrepresented compared to others. This imbalance often results in sub-optimal model performance, as classifiers tend to favour the majority class. A significant challenge arises when abnormal instances, such as outliers, exist within the minority class, diminishing the effectiveness of traditional re-sampling methods like the Synthetic Minority Over-sampling Technique (SMOTE). This manuscript addresses this critical issue by introducing four SMOTE extensions: Distance ExtSMOTE, Dirichlet ExtSMOTE, FCRP SMOTE, and BGMM SMOTE. These methods leverage a weighted average of neighbouring instances to enhance the quality of synthetic samples and mitigate the impact of outliers. Comprehensive experiments conducted on diverse simulated and real-world imbalanced datasets demonstrate that the proposed methods improve classification performance compared to the original SMOTE and its most competitive variants. Notably, we demonstrate that Dirichlet ExtSMOTE outperforms most other proposed and existing SMOTE variants in terms of achieving better F1 score, MCC, and PR-AUC. Our results underscore the effectiveness of these advanced SMOTE extensions in tackling class imbalance, particularly in the presence of abnormal instances, offering robust solutions for real-world applications.</div></div>","PeriodicalId":74093,"journal":{"name":"Machine learning with applications","volume":"18 ","pages":"Article 100597"},"PeriodicalIF":0.0000,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Machine learning with applications","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666827024000732","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Imbalanced datasets are frequent in machine learning, where certain classes are markedly underrepresented compared to others. This imbalance often results in sub-optimal model performance, as classifiers tend to favour the majority class. A significant challenge arises when abnormal instances, such as outliers, exist within the minority class, diminishing the effectiveness of traditional re-sampling methods like the Synthetic Minority Over-sampling Technique (SMOTE). This manuscript addresses this critical issue by introducing four SMOTE extensions: Distance ExtSMOTE, Dirichlet ExtSMOTE, FCRP SMOTE, and BGMM SMOTE. These methods leverage a weighted average of neighbouring instances to enhance the quality of synthetic samples and mitigate the impact of outliers. Comprehensive experiments conducted on diverse simulated and real-world imbalanced datasets demonstrate that the proposed methods improve classification performance compared to the original SMOTE and its most competitive variants. Notably, we demonstrate that Dirichlet ExtSMOTE outperforms most other proposed and existing SMOTE variants in terms of achieving better F1 score, MCC, and PR-AUC. Our results underscore the effectiveness of these advanced SMOTE extensions in tackling class imbalance, particularly in the presence of abnormal instances, offering robust solutions for real-world applications.

查看原文本刊更多论文

针对异常少数实例的不平衡数据增强 SMOTE

不平衡数据集在机器学习中经常出现，与其他类别相比，某些类别的代表性明显不足。这种不平衡往往会导致模型性能达不到最优，因为分类器往往倾向于大多数类别。当少数类中存在异常实例（如离群值）时，就会出现一个重大挑战，从而降低传统再采样方法（如合成少数群体过度采样技术（SMOTE））的有效性。本手稿通过引入四种 SMOTE 扩展来解决这一关键问题：距离 ExtSMOTE、Dirichlet ExtSMOTE、FCRP SMOTE 和 BGMM SMOTE。这些方法利用相邻实例的加权平均来提高合成样本的质量，并减轻异常值的影响。在各种模拟和真实世界不平衡数据集上进行的综合实验表明，与原始 SMOTE 及其最具竞争力的变体相比，所提出的方法提高了分类性能。值得注意的是，我们证明了 Dirichlet ExtSMOTE 在获得更好的 F1 分数、MCC 和 PR-AUC 方面优于大多数其他提出的和现有的 SMOTE 变体。我们的研究结果凸显了这些先进的 SMOTE 扩展在解决类不平衡方面的有效性，尤其是在存在异常实例的情况下，为现实世界的应用提供了稳健的解决方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Machine learning with applications Management Science and Operations Research, Artificial Intelligence, Computer Science Applications

自引率

0.00%

发文量

审稿时长

98 days