Robust Thresholding Strategies for Highly Imbalanced and Noisy Data

2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA) Pub Date : 2021-12-01 DOI:10.1109/ICMLA52953.2021.00192

Justin M. Johnson, T. Khoshgoftaar

{"title":"Robust Thresholding Strategies for Highly Imbalanced and Noisy Data","authors":"Justin M. Johnson, T. Khoshgoftaar","doi":"10.1109/ICMLA52953.2021.00192","DOIUrl":null,"url":null,"abstract":"Many studies have shown that non-default decision thresholds are required to maximize classification performance on highly imbalanced data sets. Thresholding strategies include using a threshold equal to the prior probability of the positive class or identifying an optimal threshold on training data. It is not clear, however, how these thresholding strategies will generalize to imbalanced data sets that contain class label noise. When class noise is present, the positive class prior is influenced by the class label noise, and a threshold that is optimized on noisy training data may not generalize to test data. We employ four thresholding strategies: two thresholds that are optimized on training data and two thresholds that depend on the positive class prior. Threshold strategies are evaluated on a range of noise levels and noise distributions using the Random Forest, Multilayer Perceptron, and XGBoost learners. While all four thresholding strategies significantly outperform the default threshold with respect to the Geometric Mean (G-Mean), three of the four thresholds yield unstable true positive rates (TPR) and true negative rates (TNR) in the presence of class noise. Results show that setting the threshold equal to the prior probability of the noisy positive class consistently performs best according to G-Mean, TPR, and TNR. This is the first evaluation of thresholding strategies for imbalanced and noisy data, to the best of our knowledge, and our results contradict related works that have suggested optimizing thresholds on training data as the best approach.","PeriodicalId":6750,"journal":{"name":"2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA)","volume":"17 1","pages":"1182-1188"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA52953.2021.00192","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Many studies have shown that non-default decision thresholds are required to maximize classification performance on highly imbalanced data sets. Thresholding strategies include using a threshold equal to the prior probability of the positive class or identifying an optimal threshold on training data. It is not clear, however, how these thresholding strategies will generalize to imbalanced data sets that contain class label noise. When class noise is present, the positive class prior is influenced by the class label noise, and a threshold that is optimized on noisy training data may not generalize to test data. We employ four thresholding strategies: two thresholds that are optimized on training data and two thresholds that depend on the positive class prior. Threshold strategies are evaluated on a range of noise levels and noise distributions using the Random Forest, Multilayer Perceptron, and XGBoost learners. While all four thresholding strategies significantly outperform the default threshold with respect to the Geometric Mean (G-Mean), three of the four thresholds yield unstable true positive rates (TPR) and true negative rates (TNR) in the presence of class noise. Results show that setting the threshold equal to the prior probability of the noisy positive class consistently performs best according to G-Mean, TPR, and TNR. This is the first evaluation of thresholding strategies for imbalanced and noisy data, to the best of our knowledge, and our results contradict related works that have suggested optimizing thresholds on training data as the best approach.

查看原文本刊更多论文

高度不平衡和噪声数据的鲁棒阈值策略

许多研究表明，在高度不平衡的数据集上，需要非默认决策阈值来最大化分类性能。阈值策略包括使用等于正类先验概率的阈值或识别训练数据上的最佳阈值。然而，目前尚不清楚这些阈值策略将如何推广到包含类标签噪声的不平衡数据集。当存在类噪声时，正类先验会受到类标签噪声的影响，在有噪声的训练数据上优化的阈值可能无法推广到测试数据。我们采用了四种阈值策略:两个阈值是在训练数据上优化的，另外两个阈值依赖于正类先验。阈值策略使用随机森林、多层感知器和XGBoost学习器在一系列噪声水平和噪声分布上进行评估。虽然所有四种阈值策略在几何均值(G-Mean)方面都明显优于默认阈值，但在存在类噪声的情况下，四个阈值中的三个产生不稳定的真阳性率(TPR)和真负率(TNR)。结果表明，根据G-Mean、TPR和TNR，将阈值设置为噪声正类的先验概率始终表现最佳。据我们所知，这是对不平衡和噪声数据的阈值策略的第一次评估，我们的结果与相关工作相矛盾，这些工作建议优化训练数据的阈值是最佳方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA)

自引率

0.00%

发文量