{"title":"鲁棒预测分析的集成过采样和降噪方法","authors":"Jeong-Wook Lee , Young Eun Jeon , Jung-In Seo","doi":"10.1016/j.dajour.2025.100612","DOIUrl":null,"url":null,"abstract":"<div><div>Imbalanced data is often encountered in scenarios where rare but critical events occur much less frequently than others, and it is particularly prominent in fields such as disease diagnosis, fraud detection, and risk management. The main problem with imbalanced data is that predictive models using machine learning algorithms are likely to become biased toward the majority class. For example, the models may have high overall accuracy but perform poorly in correctly identifying the minority class data points. In this situation, if our interest is the minority class, the models may lead to serious misclassifications, which impairs the reliability and validity of the predictions. In response to this issue, this study develops a resampling strategy integrated with random over-sampling examples and Tomek link. The developed resampling strategy increases data diversity by generating synthetic data points based on a probability distribution while eliminating noisy and overlapping data points, resulting in a higher-quality dataset. For illustrative purposes, a stroke dataset with a serious imbalance ratio of 98:2 is employed. To evaluate the performance of our resampling strategy and demonstrate its applicability, we apply a wide range of machine learning and deep learning models, including support vector machine, elastic net, random forest, extreme gradient boosting, and deep and convolutional neural networks. The outcomes of this study suggest that the developed resampling strategy can be effectively applied to other medical datasets with severe class imbalances and can enable more reliable and efficient predictive modeling in critical healthcare applications.</div></div>","PeriodicalId":100357,"journal":{"name":"Decision Analytics Journal","volume":"16 ","pages":"Article 100612"},"PeriodicalIF":0.0000,"publicationDate":"2025-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An integrated oversampling and noise reduction method for robust predictive analytics\",\"authors\":\"Jeong-Wook Lee , Young Eun Jeon , Jung-In Seo\",\"doi\":\"10.1016/j.dajour.2025.100612\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Imbalanced data is often encountered in scenarios where rare but critical events occur much less frequently than others, and it is particularly prominent in fields such as disease diagnosis, fraud detection, and risk management. The main problem with imbalanced data is that predictive models using machine learning algorithms are likely to become biased toward the majority class. For example, the models may have high overall accuracy but perform poorly in correctly identifying the minority class data points. In this situation, if our interest is the minority class, the models may lead to serious misclassifications, which impairs the reliability and validity of the predictions. In response to this issue, this study develops a resampling strategy integrated with random over-sampling examples and Tomek link. The developed resampling strategy increases data diversity by generating synthetic data points based on a probability distribution while eliminating noisy and overlapping data points, resulting in a higher-quality dataset. For illustrative purposes, a stroke dataset with a serious imbalance ratio of 98:2 is employed. To evaluate the performance of our resampling strategy and demonstrate its applicability, we apply a wide range of machine learning and deep learning models, including support vector machine, elastic net, random forest, extreme gradient boosting, and deep and convolutional neural networks. The outcomes of this study suggest that the developed resampling strategy can be effectively applied to other medical datasets with severe class imbalances and can enable more reliable and efficient predictive modeling in critical healthcare applications.</div></div>\",\"PeriodicalId\":100357,\"journal\":{\"name\":\"Decision Analytics Journal\",\"volume\":\"16 \",\"pages\":\"Article 100612\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-07-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Decision Analytics Journal\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2772662225000682\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Decision Analytics Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772662225000682","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
An integrated oversampling and noise reduction method for robust predictive analytics
Imbalanced data is often encountered in scenarios where rare but critical events occur much less frequently than others, and it is particularly prominent in fields such as disease diagnosis, fraud detection, and risk management. The main problem with imbalanced data is that predictive models using machine learning algorithms are likely to become biased toward the majority class. For example, the models may have high overall accuracy but perform poorly in correctly identifying the minority class data points. In this situation, if our interest is the minority class, the models may lead to serious misclassifications, which impairs the reliability and validity of the predictions. In response to this issue, this study develops a resampling strategy integrated with random over-sampling examples and Tomek link. The developed resampling strategy increases data diversity by generating synthetic data points based on a probability distribution while eliminating noisy and overlapping data points, resulting in a higher-quality dataset. For illustrative purposes, a stroke dataset with a serious imbalance ratio of 98:2 is employed. To evaluate the performance of our resampling strategy and demonstrate its applicability, we apply a wide range of machine learning and deep learning models, including support vector machine, elastic net, random forest, extreme gradient boosting, and deep and convolutional neural networks. The outcomes of this study suggest that the developed resampling strategy can be effectively applied to other medical datasets with severe class imbalances and can enable more reliable and efficient predictive modeling in critical healthcare applications.