Tree boosting methods for balanced and imbalanced classification and their robustness over time in risk assessment

Intelligent Systems with Applications Pub Date : 2024-03-12 DOI:10.1016/j.iswa.2024.200354

Gissel Velarde , Michael Weichert, Anuj Deshmunkh, Sanjay Deshmane, Anindya Sudhir, Khushboo Sharma, Vaibhav Joshi

{"title":"Tree boosting methods for balanced and imbalanced classification and their robustness over time in risk assessment","authors":"Gissel Velarde , Michael Weichert, Anuj Deshmunkh, Sanjay Deshmane, Anindya Sudhir, Khushboo Sharma, Vaibhav Joshi","doi":"10.1016/j.iswa.2024.200354","DOIUrl":null,"url":null,"abstract":"<div><p>Most real-world classification problems deal with imbalanced datasets, posing a challenge for Artificial Intelligence (AI), i.e., machine learning algorithms, because the minority class, which is of extreme interest, often proves difficult to be detected. This paper empirically evaluates tree boosting methods' performance given different dataset sizes and class distributions, from perfectly balanced to highly imbalanced. For tabular data, tree-based methods such as XGBoost, stand out in several benchmarks due to detection performance and speed. Therefore, XGBoost and Imbalance-XGBoost are evaluated. After introducing the motivation to address risk assessment with machine learning, the paper reviews evaluation metrics for detection systems or binary classifiers. It proposes a method for data preparation followed by tree boosting methods including hyper-parameter optimization. The method is evaluated on private datasets of 1 thousand (K), 10K and 100K samples on distributions with 50, 45, 25, and 5 percent positive samples. As expected, the developed method increases its recognition performance as more data is given for training and the F1 score decreases as the data distribution becomes more imbalanced, but it is still significantly superior to the baseline of precision-recall determined by the ratio of positives divided by positives and negatives. Sampling to balance the training set does not provide consistent improvement and deteriorates detection. In contrast, classifier hyper-parameter optimization improves recognition, but should be applied carefully depending on data volume and distribution. Finally, the developed method is robust to data variation over time up to some point. Retraining can be used when performance starts deteriorating.</p></div>","PeriodicalId":100684,"journal":{"name":"Intelligent Systems with Applications","volume":"22 ","pages":"Article 200354"},"PeriodicalIF":0.0000,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2667305324000309/pdfft?md5=be6e208c32a749998c8ea1ee56dcab8e&pid=1-s2.0-S2667305324000309-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Intelligent Systems with Applications","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2667305324000309","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Most real-world classification problems deal with imbalanced datasets, posing a challenge for Artificial Intelligence (AI), i.e., machine learning algorithms, because the minority class, which is of extreme interest, often proves difficult to be detected. This paper empirically evaluates tree boosting methods' performance given different dataset sizes and class distributions, from perfectly balanced to highly imbalanced. For tabular data, tree-based methods such as XGBoost, stand out in several benchmarks due to detection performance and speed. Therefore, XGBoost and Imbalance-XGBoost are evaluated. After introducing the motivation to address risk assessment with machine learning, the paper reviews evaluation metrics for detection systems or binary classifiers. It proposes a method for data preparation followed by tree boosting methods including hyper-parameter optimization. The method is evaluated on private datasets of 1 thousand (K), 10K and 100K samples on distributions with 50, 45, 25, and 5 percent positive samples. As expected, the developed method increases its recognition performance as more data is given for training and the F1 score decreases as the data distribution becomes more imbalanced, but it is still significantly superior to the baseline of precision-recall determined by the ratio of positives divided by positives and negatives. Sampling to balance the training set does not provide consistent improvement and deteriorates detection. In contrast, classifier hyper-parameter optimization improves recognition, but should be applied carefully depending on data volume and distribution. Finally, the developed method is robust to data variation over time up to some point. Retraining can be used when performance starts deteriorating.

Abstract Image

查看原文本刊更多论文

用于平衡和不平衡分类的树状提升方法及其在风险评估中的长期稳健性

现实世界中的大多数分类问题都与不平衡数据集有关，这给人工智能（AI），即机器学习算法带来了挑战，因为极受关注的少数类往往难以被检测到。本文根据从完全平衡到高度不平衡的不同数据集规模和类分布，对树增强方法的性能进行了实证评估。对于表格数据，XGBoost 等基于树的方法因其检测性能和速度在多个基准测试中脱颖而出。因此，我们对 XGBoost 和 Imbalance-XGBoost 进行了评估。在介绍了利用机器学习进行风险评估的动机之后，本文回顾了检测系统或二元分类器的评估指标。论文提出了一种数据准备方法，随后提出了包括超参数优化在内的树状提升方法。该方法在包含 1,000、10,000 和 100,000 个样本的私人数据集上进行了评估，样本分布的阳性率分别为 50%、45%、25% 和 5%。正如预期的那样，随着训练数据的增加，所开发的方法的识别性能也会提高，而随着数据分布变得更加不平衡，F1 分数也会降低，但它仍然明显优于由阳性样本除以阳性样本和阴性样本的比例决定的精确度-识别基线。为平衡训练集而采样并不能带来持续的改进，反而会降低检测效果。相反，分类器超参数优化可以提高识别率，但应根据数据量和分布情况谨慎应用。最后，所开发的方法在一定时间内对数据变化具有鲁棒性。当性能开始下降时，可以使用重新训练。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Intelligent Systems with Applications

CiteScore

5.60

自引率

0.00%

发文量