使用SMOTE、RUS和随机森林方法解决糖尿病数据不平衡的基于特征的集成建模：一项预测研究

IF 0.2 Q3 MEDICINE, GENERAL & INTERNAL

Ewha Medical Journal Pub Date : 2025-04-01 Epub Date: 2025-04-15 DOI:10.12771/emj.2025.00353

Younseo Jang

{"title":"使用SMOTE、RUS和随机森林方法解决糖尿病数据不平衡的基于特征的集成建模：一项预测研究","authors":"Younseo Jang","doi":"10.12771/emj.2025.00353","DOIUrl":null,"url":null,"abstract":"Purpose: This study developed and evaluated a feature-based ensemble model integrating the synthetic minority oversampling technique (SMOTE) and random undersampling (RUS) methods with a random forest approach to address class imbalance in machine learning for early diabetes detection, aiming to improve predictive performance.Methods: Using the Scikit-learn diabetes dataset (442 samples, 10 features), we binarized the target variable (diabetes progression) at the 75th percentile and split it 80:20 using stratified sampling. The training set was balanced to a 1:2 minority-to-majority ratio via SMOTE (0.6) and RUS (0.66). A feature-based ensemble model was constructed by training random forest classifiers on 10 two-feature subsets, selected based on feature importance, and combining their outputs using soft voting. Performance was compared against 13 baseline models, using accuracy and area under the curve (AUC) as metrics on the imbalanced test set.Results: The feature-based ensemble model and balanced random forest both achieved the highest accuracy (0.8764), followed by the fully connected neural network (0.8700). The ensemble model had an excellent AUC (0.9227), while k-nearest neighbors had the lowest accuracy (0.8427). Visualizations confirmed its superior discriminative ability, especially for the minority (high-risk) class, which is a critical factor in medical contexts.Conclusion: Integrating SMOTE, RUS, and feature-based ensemble learning improved classification performance in imbalanced diabetes datasets by delivering robust accuracy and high recall for the minority class. This approach outperforms traditional resampling techniques and deep learning models, offering a scalable and interpretable solution for early diabetes prediction and potentially other medical applications.","PeriodicalId":41392,"journal":{"name":"Ewha Medical Journal","volume":"48 2","pages":"e32"},"PeriodicalIF":0.2000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12277495/pdf/","citationCount":"0","resultStr":"{\"title\":\"Feature-based ensemble modeling for addressing diabetes data imbalance using the SMOTE, RUS, and random forest methods: a prediction study.\",\"authors\":\"Younseo Jang\",\"doi\":\"10.12771/emj.2025.00353\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Purpose: This study developed and evaluated a feature-based ensemble model integrating the synthetic minority oversampling technique (SMOTE) and random undersampling (RUS) methods with a random forest approach to address class imbalance in machine learning for early diabetes detection, aiming to improve predictive performance.Methods: Using the Scikit-learn diabetes dataset (442 samples, 10 features), we binarized the target variable (diabetes progression) at the 75th percentile and split it 80:20 using stratified sampling. The training set was balanced to a 1:2 minority-to-majority ratio via SMOTE (0.6) and RUS (0.66). A feature-based ensemble model was constructed by training random forest classifiers on 10 two-feature subsets, selected based on feature importance, and combining their outputs using soft voting. Performance was compared against 13 baseline models, using accuracy and area under the curve (AUC) as metrics on the imbalanced test set.Results: The feature-based ensemble model and balanced random forest both achieved the highest accuracy (0.8764), followed by the fully connected neural network (0.8700). The ensemble model had an excellent AUC (0.9227), while k-nearest neighbors had the lowest accuracy (0.8427). Visualizations confirmed its superior discriminative ability, especially for the minority (high-risk) class, which is a critical factor in medical contexts.Conclusion: Integrating SMOTE, RUS, and feature-based ensemble learning improved classification performance in imbalanced diabetes datasets by delivering robust accuracy and high recall for the minority class. This approach outperforms traditional resampling techniques and deep learning models, offering a scalable and interpretable solution for early diabetes prediction and potentially other medical applications.\",\"PeriodicalId\":41392,\"journal\":{\"name\":\"Ewha Medical Journal\",\"volume\":\"48 2\",\"pages\":\"e32\"},\"PeriodicalIF\":0.2000,\"publicationDate\":\"2025-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12277495/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Ewha Medical Journal\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.12771/emj.2025.00353\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/4/15 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q3\",\"JCRName\":\"MEDICINE, GENERAL & INTERNAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ewha Medical Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.12771/emj.2025.00353","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/4/15 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"MEDICINE, GENERAL & INTERNAL","Score":null,"Total":0}

引用次数: 0

摘要

目的：本研究开发并评估了一种基于特征的集成模型，该模型将合成少数过采样技术（SMOTE）和随机欠采样（RUS）方法与随机森林方法相结合，以解决机器学习中早期糖尿病检测中的类别不平衡问题，旨在提高预测性能。方法：使用Scikit-learn糖尿病数据集（442个样本，10个特征），我们在第75百分位对目标变量（糖尿病进展）进行二值化，并使用分层抽样将其分割为80:20。通过SMOTE（0.6）和RUS（0.66）将训练集平衡为1:2的少数与多数比例。通过训练随机森林分类器对10个基于特征重要性的双特征子集进行训练，并使用软投票将它们的输出组合在一起，构建了基于特征的集成模型。将性能与13个基线模型进行比较，使用准确性和曲线下面积（AUC）作为不平衡测试集的指标。结果：基于特征的集成模型和平衡随机森林的准确率最高（0.8764），全连接神经网络次之（0.8700）。集成模型的AUC（0.9227）很好，而k近邻模型的精度最低（0.8427）。可视化证实了其优越的辨别能力，特别是对少数（高风险）类别，这是医疗环境中的关键因素。结论：整合SMOTE、RUS和基于特征的集成学习，通过对少数类别提供稳健的准确率和高召回率，提高了不平衡糖尿病数据集的分类性能。这种方法优于传统的重新采样技术和深度学习模型，为早期糖尿病预测和潜在的其他医疗应用提供了可扩展和可解释的解决方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Feature-based ensemble modeling for addressing diabetes data imbalance using the SMOTE, RUS, and random forest methods: a prediction study.

查看原文本刊更多论文

Feature-based ensemble modeling for addressing diabetes data imbalance using the SMOTE, RUS, and random forest methods: a prediction study.

Purpose: This study developed and evaluated a feature-based ensemble model integrating the synthetic minority oversampling technique (SMOTE) and random undersampling (RUS) methods with a random forest approach to address class imbalance in machine learning for early diabetes detection, aiming to improve predictive performance.

Methods: Using the Scikit-learn diabetes dataset (442 samples, 10 features), we binarized the target variable (diabetes progression) at the 75th percentile and split it 80:20 using stratified sampling. The training set was balanced to a 1:2 minority-to-majority ratio via SMOTE (0.6) and RUS (0.66). A feature-based ensemble model was constructed by training random forest classifiers on 10 two-feature subsets, selected based on feature importance, and combining their outputs using soft voting. Performance was compared against 13 baseline models, using accuracy and area under the curve (AUC) as metrics on the imbalanced test set.

Results: The feature-based ensemble model and balanced random forest both achieved the highest accuracy (0.8764), followed by the fully connected neural network (0.8700). The ensemble model had an excellent AUC (0.9227), while k-nearest neighbors had the lowest accuracy (0.8427). Visualizations confirmed its superior discriminative ability, especially for the minority (high-risk) class, which is a critical factor in medical contexts.

Conclusion: Integrating SMOTE, RUS, and feature-based ensemble learning improved classification performance in imbalanced diabetes datasets by delivering robust accuracy and high recall for the minority class. This approach outperforms traditional resampling techniques and deep learning models, offering a scalable and interpretable solution for early diabetes prediction and potentially other medical applications.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Ewha Medical Journal MEDICINE, GENERAL & INTERNAL-

自引率

33.30%

发文量