Cost-Sensitive Ensemble Learning for Highly Imbalanced Classification

2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA) Pub Date : 2022-12-01 DOI:10.1109/ICMLA55696.2022.00225

Justin M. Johnson, T. Khoshgoftaar

{"title":"Cost-Sensitive Ensemble Learning for Highly Imbalanced Classification","authors":"Justin M. Johnson, T. Khoshgoftaar","doi":"10.1109/ICMLA55696.2022.00225","DOIUrl":null,"url":null,"abstract":"There are a variety of data-level and algorithm-level methods available for treating class imbalance. Data-level methods include data sampling strategies that pre-process training data to reduce levels of class imbalance. Algorithm-level methods modify the learning and inference processes to reduce bias towards the majority class. This study evaluates both data-level and algorithm-level methods for class imbalance using a highly imbalanced healthcare fraud data set. We approach the problem from a cost-sensitive learning perspective, and demonstrate how these direct and indirect cost-sensitive methods can be implemented using a common cost matrix. For each method, a wide range of costs are evaluated using three popular ensemble learning algorithms. Initial results show that random undersampling (RUS) and class weighting are both effective ways to improve classification when the default classification threshold is used. Further analysis using the area under the precision-recall curve, however, shows that both RUS and class weighting actually decrease the discriminative power of these learners. Through multiple complementary performance metrics and confidence interval analysis, we find that the best model performance is consistently obtained when RUS and class weighting are not applied, but when output thresholding is used to maximize the confusion matrix instead. Our contributions include various recommendations related to implementing cost-sensitive ensemble learning and effective model evaluation, as well as empirical evidence that contradicts popular beliefs about learning from imbalanced data.","PeriodicalId":128160,"journal":{"name":"2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA55696.2022.00225","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

There are a variety of data-level and algorithm-level methods available for treating class imbalance. Data-level methods include data sampling strategies that pre-process training data to reduce levels of class imbalance. Algorithm-level methods modify the learning and inference processes to reduce bias towards the majority class. This study evaluates both data-level and algorithm-level methods for class imbalance using a highly imbalanced healthcare fraud data set. We approach the problem from a cost-sensitive learning perspective, and demonstrate how these direct and indirect cost-sensitive methods can be implemented using a common cost matrix. For each method, a wide range of costs are evaluated using three popular ensemble learning algorithms. Initial results show that random undersampling (RUS) and class weighting are both effective ways to improve classification when the default classification threshold is used. Further analysis using the area under the precision-recall curve, however, shows that both RUS and class weighting actually decrease the discriminative power of these learners. Through multiple complementary performance metrics and confidence interval analysis, we find that the best model performance is consistently obtained when RUS and class weighting are not applied, but when output thresholding is used to maximize the confusion matrix instead. Our contributions include various recommendations related to implementing cost-sensitive ensemble learning and effective model evaluation, as well as empirical evidence that contradicts popular beliefs about learning from imbalanced data.

查看原文本刊更多论文

高度不平衡分类的代价敏感集成学习

有许多数据级和算法级的方法可用于处理类不平衡。数据级方法包括数据采样策略，该策略对训练数据进行预处理以降低类不平衡水平。算法级方法修改学习和推理过程，以减少对多数类的偏见。本研究使用高度不平衡的医疗保健欺诈数据集，评估了数据级和算法级方法的类不平衡。我们从成本敏感学习的角度来解决这个问题，并演示如何使用一个共同的成本矩阵来实现这些直接和间接的成本敏感方法。对于每种方法，使用三种流行的集成学习算法来评估广泛的成本。初步结果表明，当使用默认分类阈值时，随机欠采样(RUS)和类加权都是改进分类的有效方法。然而，使用精确召回曲线下的面积进一步分析表明，RUS和班级权重实际上降低了这些学习者的判别能力。通过多个互补的性能指标和置信区间分析，我们发现当不使用RUS和类权重时，而是使用输出阈值来最大化混淆矩阵时，模型性能始终保持最佳。我们的贡献包括与实施成本敏感集成学习和有效模型评估相关的各种建议，以及与从不平衡数据中学习的流行观点相矛盾的经验证据。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA)

自引率

0.00%

发文量