PRO-SMOTEBoost: An adaptive SMOTEBoost probabilistic algorithm for rebalancing and improving imbalanced data classification

IF 8.1 1区 计算机科学 0 COMPUTER SCIENCE, INFORMATION SYSTEMS
Laouni Djafri
{"title":"PRO-SMOTEBoost: An adaptive SMOTEBoost probabilistic algorithm for rebalancing and improving imbalanced data classification","authors":"Laouni Djafri","doi":"10.1016/j.ins.2024.121548","DOIUrl":null,"url":null,"abstract":"<div><div>In the field of data mining and machine learning, dealing with imbalanced datasets is one of the most complex problems. The class imbalance issue significantly affects the classification of minority classes when using common classification algorithms. These algorithms often prioritize improving the performance of the majority class at the expense of the minority class, leading to misclassifying negative instances as positive ones. To address this problem, the Synthetic Minority Over-sampling Technique (SMOTE) has gained popularity to rebalance imbalanced data for classification. However, in this paper, we propose two algorithms to enhance the performance of imbalanced classification further. The first algorithm is PRO-SMOTE, an improvement over SMOTE. PRO-SMOTE relies on conditional probabilities to effectively rebalance imbalanced classes and improve the predictive performance metrics satisfactorily and reliably. By considering conditional probabilities, PRO-SMOTE can reduce the majority classes and optimally increase the minority class. Second, the PRO-SMOTEBoost algorithm, in turn, is based on the PRO-SMOTE to overcome classification anomalies and problems encountered by machine learning algorithms during classification, especially the weak ones. PRO-SMOTEBoost aims to maximize predictive precision to the greatest extent possible by combining the strengths of PRO-SMOTE with boosting techniques. Evaluating these algorithms using traditional machine learning algorithms such as Random Forests, C4.5, Naive Bayes, and Support Vector Machines has demonstrated excellent classification results. The performance metrics, encompassing F1-score, G-means, Precision, Accuracy, Recall, AUC-ROC, and Precision-Recall-curves, achieved by the proposed algorithm demonstrate a range that extends from over 90% to a flawless score of 100%. Compared to using these traditional algorithms individually, the utilization of PRO-SMOTEBoost has shown a significant improvement of 10% to 40% in performance metrics. Overall, the proposed algorithms, PRO-SMOTE and PRO-SMOTEBoost, offer effective solutions to address the challenges posed by imbalanced datasets. They provide improved predictive metrics and demonstrate their superiority when compared to traditional even modern machine learning algorithms.</div></div>","PeriodicalId":51063,"journal":{"name":"Information Sciences","volume":"690 ","pages":"Article 121548"},"PeriodicalIF":8.1000,"publicationDate":"2024-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Sciences","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0020025524014622","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

In the field of data mining and machine learning, dealing with imbalanced datasets is one of the most complex problems. The class imbalance issue significantly affects the classification of minority classes when using common classification algorithms. These algorithms often prioritize improving the performance of the majority class at the expense of the minority class, leading to misclassifying negative instances as positive ones. To address this problem, the Synthetic Minority Over-sampling Technique (SMOTE) has gained popularity to rebalance imbalanced data for classification. However, in this paper, we propose two algorithms to enhance the performance of imbalanced classification further. The first algorithm is PRO-SMOTE, an improvement over SMOTE. PRO-SMOTE relies on conditional probabilities to effectively rebalance imbalanced classes and improve the predictive performance metrics satisfactorily and reliably. By considering conditional probabilities, PRO-SMOTE can reduce the majority classes and optimally increase the minority class. Second, the PRO-SMOTEBoost algorithm, in turn, is based on the PRO-SMOTE to overcome classification anomalies and problems encountered by machine learning algorithms during classification, especially the weak ones. PRO-SMOTEBoost aims to maximize predictive precision to the greatest extent possible by combining the strengths of PRO-SMOTE with boosting techniques. Evaluating these algorithms using traditional machine learning algorithms such as Random Forests, C4.5, Naive Bayes, and Support Vector Machines has demonstrated excellent classification results. The performance metrics, encompassing F1-score, G-means, Precision, Accuracy, Recall, AUC-ROC, and Precision-Recall-curves, achieved by the proposed algorithm demonstrate a range that extends from over 90% to a flawless score of 100%. Compared to using these traditional algorithms individually, the utilization of PRO-SMOTEBoost has shown a significant improvement of 10% to 40% in performance metrics. Overall, the proposed algorithms, PRO-SMOTE and PRO-SMOTEBoost, offer effective solutions to address the challenges posed by imbalanced datasets. They provide improved predictive metrics and demonstrate their superiority when compared to traditional even modern machine learning algorithms.
PRO-SMOTEBoost用于重新平衡和改进不平衡数据分类的自适应 SMOTEBoost 概率算法
在数据挖掘和机器学习领域,处理不平衡数据集是最复杂的问题之一。在使用普通分类算法时,类不平衡问题会严重影响少数类的分类。这些算法通常会优先提高多数类的性能,而牺牲少数类的性能,从而导致误将负实例分类为正实例。为了解决这个问题,合成少数群体过度采样技术(SMOTE)在重新平衡不平衡数据进行分类方面受到了广泛欢迎。不过,在本文中,我们提出了两种算法,以进一步提高不平衡分类的性能。第一种算法是 PRO-SMOTE,是对 SMOTE 的改进。PRO-SMOTE 依靠条件概率来有效地重新平衡不平衡类,并令人满意和可靠地提高预测性能指标。通过考虑条件概率,PRO-SMOTE 可以减少多数类,优化增加少数类。其次,PRO-SMOTEBoost 算法又是在 PRO-SMOTE 的基础上,克服机器学习算法在分类过程中遇到的分类异常和问题,尤其是弱分类问题。PRO-SMOTEBoost 的目标是通过将 PRO-SMOTE 的优势与提升技术相结合,最大限度地提高预测精度。使用随机森林、C4.5、奈夫贝叶斯和支持向量机等传统机器学习算法对这些算法进行评估,结果显示分类效果极佳。拟议算法的性能指标包括 F1 分数、G-means、精确度、准确度、召回率、AUC-ROC 和精确度-召回率曲线,其范围从 90% 以上到 100% 的完美分数。与单独使用这些传统算法相比,PRO-SMOTEBoost 的性能指标显著提高了 10%至 40%。总之,所提出的 PRO-SMOTE 和 PRO-SMOTEBoost 算法为应对不平衡数据集带来的挑战提供了有效的解决方案。与传统甚至现代的机器学习算法相比,它们提供了更好的预测指标,并展示了其优越性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Information Sciences
Information Sciences 工程技术-计算机:信息系统
CiteScore
14.00
自引率
17.30%
发文量
1322
审稿时长
10.4 months
期刊介绍: Informatics and Computer Science Intelligent Systems Applications is an esteemed international journal that focuses on publishing original and creative research findings in the field of information sciences. We also feature a limited number of timely tutorial and surveying contributions. Our journal aims to cater to a diverse audience, including researchers, developers, managers, strategic planners, graduate students, and anyone interested in staying up-to-date with cutting-edge research in information science, knowledge engineering, and intelligent systems. While readers are expected to share a common interest in information science, they come from varying backgrounds such as engineering, mathematics, statistics, physics, computer science, cell biology, molecular biology, management science, cognitive science, neurobiology, behavioral sciences, and biochemistry.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信