鲁棒预测分析的集成过采样和降噪方法

Decision Analytics Journal Pub Date : 2025-07-23 DOI:10.1016/j.dajour.2025.100612

Jeong-Wook Lee , Young Eun Jeon , Jung-In Seo

{"title":"鲁棒预测分析的集成过采样和降噪方法","authors":"Jeong-Wook Lee , Young Eun Jeon , Jung-In Seo","doi":"10.1016/j.dajour.2025.100612","DOIUrl":null,"url":null,"abstract":"<div><div>Imbalanced data is often encountered in scenarios where rare but critical events occur much less frequently than others, and it is particularly prominent in fields such as disease diagnosis, fraud detection, and risk management. The main problem with imbalanced data is that predictive models using machine learning algorithms are likely to become biased toward the majority class. For example, the models may have high overall accuracy but perform poorly in correctly identifying the minority class data points. In this situation, if our interest is the minority class, the models may lead to serious misclassifications, which impairs the reliability and validity of the predictions. In response to this issue, this study develops a resampling strategy integrated with random over-sampling examples and Tomek link. The developed resampling strategy increases data diversity by generating synthetic data points based on a probability distribution while eliminating noisy and overlapping data points, resulting in a higher-quality dataset. For illustrative purposes, a stroke dataset with a serious imbalance ratio of 98:2 is employed. To evaluate the performance of our resampling strategy and demonstrate its applicability, we apply a wide range of machine learning and deep learning models, including support vector machine, elastic net, random forest, extreme gradient boosting, and deep and convolutional neural networks. The outcomes of this study suggest that the developed resampling strategy can be effectively applied to other medical datasets with severe class imbalances and can enable more reliable and efficient predictive modeling in critical healthcare applications.</div></div>","PeriodicalId":100357,"journal":{"name":"Decision Analytics Journal","volume":"16 ","pages":"Article 100612"},"PeriodicalIF":0.0000,"publicationDate":"2025-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An integrated oversampling and noise reduction method for robust predictive analytics\",\"authors\":\"Jeong-Wook Lee , Young Eun Jeon , Jung-In Seo\",\"doi\":\"10.1016/j.dajour.2025.100612\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Imbalanced data is often encountered in scenarios where rare but critical events occur much less frequently than others, and it is particularly prominent in fields such as disease diagnosis, fraud detection, and risk management. The main problem with imbalanced data is that predictive models using machine learning algorithms are likely to become biased toward the majority class. For example, the models may have high overall accuracy but perform poorly in correctly identifying the minority class data points. In this situation, if our interest is the minority class, the models may lead to serious misclassifications, which impairs the reliability and validity of the predictions. In response to this issue, this study develops a resampling strategy integrated with random over-sampling examples and Tomek link. The developed resampling strategy increases data diversity by generating synthetic data points based on a probability distribution while eliminating noisy and overlapping data points, resulting in a higher-quality dataset. For illustrative purposes, a stroke dataset with a serious imbalance ratio of 98:2 is employed. To evaluate the performance of our resampling strategy and demonstrate its applicability, we apply a wide range of machine learning and deep learning models, including support vector machine, elastic net, random forest, extreme gradient boosting, and deep and convolutional neural networks. The outcomes of this study suggest that the developed resampling strategy can be effectively applied to other medical datasets with severe class imbalances and can enable more reliable and efficient predictive modeling in critical healthcare applications.</div></div>\",\"PeriodicalId\":100357,\"journal\":{\"name\":\"Decision Analytics Journal\",\"volume\":\"16 \",\"pages\":\"Article 100612\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-07-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Decision Analytics Journal\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2772662225000682\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Decision Analytics Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772662225000682","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

在罕见但关键事件发生的频率远低于其他事件的情况下，经常会遇到数据不平衡的情况，在疾病诊断、欺诈检测和风险管理等领域尤为突出。数据不平衡的主要问题是，使用机器学习算法的预测模型可能会偏向大多数人。例如，模型可能具有很高的总体准确性，但在正确识别少数类数据点方面表现不佳。在这种情况下，如果我们的兴趣是少数阶级，模型可能会导致严重的错误分类，从而损害预测的可靠性和有效性。针对这一问题，本研究提出了一种随机过采样实例与Tomek链接相结合的重采样策略。开发的重采样策略通过基于概率分布生成合成数据点来增加数据多样性，同时消除噪声和重叠数据点，从而获得更高质量的数据集。为了便于说明，我们使用了一个严重失衡比例为98:2的笔画数据集。为了评估我们的重采样策略的性能并证明其适用性，我们应用了广泛的机器学习和深度学习模型，包括支持向量机、弹性网络、随机森林、极端梯度增强以及深度和卷积神经网络。本研究的结果表明，开发的重采样策略可以有效地应用于其他具有严重类别失衡的医疗数据集，并可以在关键的医疗保健应用中实现更可靠和有效的预测建模。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

An integrated oversampling and noise reduction method for robust predictive analytics

Imbalanced data is often encountered in scenarios where rare but critical events occur much less frequently than others, and it is particularly prominent in fields such as disease diagnosis, fraud detection, and risk management. The main problem with imbalanced data is that predictive models using machine learning algorithms are likely to become biased toward the majority class. For example, the models may have high overall accuracy but perform poorly in correctly identifying the minority class data points. In this situation, if our interest is the minority class, the models may lead to serious misclassifications, which impairs the reliability and validity of the predictions. In response to this issue, this study develops a resampling strategy integrated with random over-sampling examples and Tomek link. The developed resampling strategy increases data diversity by generating synthetic data points based on a probability distribution while eliminating noisy and overlapping data points, resulting in a higher-quality dataset. For illustrative purposes, a stroke dataset with a serious imbalance ratio of 98:2 is employed. To evaluate the performance of our resampling strategy and demonstrate its applicability, we apply a wide range of machine learning and deep learning models, including support vector machine, elastic net, random forest, extreme gradient boosting, and deep and convolutional neural networks. The outcomes of this study suggest that the developed resampling strategy can be effectively applied to other medical datasets with severe class imbalances and can enable more reliable and efficient predictive modeling in critical healthcare applications.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Decision Analytics Journal

CiteScore

3.90

自引率

0.00%

发文量