使用SMOTE、RUS和随机森林方法解决糖尿病数据不平衡的基于特征的集成建模:一项预测研究

IF 0.2 Q3 MEDICINE, GENERAL & INTERNAL
Ewha Medical Journal Pub Date : 2025-04-01 Epub Date: 2025-04-15 DOI:10.12771/emj.2025.00353
Younseo Jang
{"title":"使用SMOTE、RUS和随机森林方法解决糖尿病数据不平衡的基于特征的集成建模:一项预测研究","authors":"Younseo Jang","doi":"10.12771/emj.2025.00353","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>This study developed and evaluated a feature-based ensemble model integrating the synthetic minority oversampling technique (SMOTE) and random undersampling (RUS) methods with a random forest approach to address class imbalance in machine learning for early diabetes detection, aiming to improve predictive performance.</p><p><strong>Methods: </strong>Using the Scikit-learn diabetes dataset (442 samples, 10 features), we binarized the target variable (diabetes progression) at the 75th percentile and split it 80:20 using stratified sampling. The training set was balanced to a 1:2 minority-to-majority ratio via SMOTE (0.6) and RUS (0.66). A feature-based ensemble model was constructed by training random forest classifiers on 10 two-feature subsets, selected based on feature importance, and combining their outputs using soft voting. Performance was compared against 13 baseline models, using accuracy and area under the curve (AUC) as metrics on the imbalanced test set.</p><p><strong>Results: </strong>The feature-based ensemble model and balanced random forest both achieved the highest accuracy (0.8764), followed by the fully connected neural network (0.8700). The ensemble model had an excellent AUC (0.9227), while k-nearest neighbors had the lowest accuracy (0.8427). Visualizations confirmed its superior discriminative ability, especially for the minority (high-risk) class, which is a critical factor in medical contexts.</p><p><strong>Conclusion: </strong>Integrating SMOTE, RUS, and feature-based ensemble learning improved classification performance in imbalanced diabetes datasets by delivering robust accuracy and high recall for the minority class. This approach outperforms traditional resampling techniques and deep learning models, offering a scalable and interpretable solution for early diabetes prediction and potentially other medical applications.</p>","PeriodicalId":41392,"journal":{"name":"Ewha Medical Journal","volume":"48 2","pages":"e32"},"PeriodicalIF":0.2000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12277495/pdf/","citationCount":"0","resultStr":"{\"title\":\"Feature-based ensemble modeling for addressing diabetes data imbalance using the SMOTE, RUS, and random forest methods: a prediction study.\",\"authors\":\"Younseo Jang\",\"doi\":\"10.12771/emj.2025.00353\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Purpose: </strong>This study developed and evaluated a feature-based ensemble model integrating the synthetic minority oversampling technique (SMOTE) and random undersampling (RUS) methods with a random forest approach to address class imbalance in machine learning for early diabetes detection, aiming to improve predictive performance.</p><p><strong>Methods: </strong>Using the Scikit-learn diabetes dataset (442 samples, 10 features), we binarized the target variable (diabetes progression) at the 75th percentile and split it 80:20 using stratified sampling. The training set was balanced to a 1:2 minority-to-majority ratio via SMOTE (0.6) and RUS (0.66). A feature-based ensemble model was constructed by training random forest classifiers on 10 two-feature subsets, selected based on feature importance, and combining their outputs using soft voting. Performance was compared against 13 baseline models, using accuracy and area under the curve (AUC) as metrics on the imbalanced test set.</p><p><strong>Results: </strong>The feature-based ensemble model and balanced random forest both achieved the highest accuracy (0.8764), followed by the fully connected neural network (0.8700). The ensemble model had an excellent AUC (0.9227), while k-nearest neighbors had the lowest accuracy (0.8427). Visualizations confirmed its superior discriminative ability, especially for the minority (high-risk) class, which is a critical factor in medical contexts.</p><p><strong>Conclusion: </strong>Integrating SMOTE, RUS, and feature-based ensemble learning improved classification performance in imbalanced diabetes datasets by delivering robust accuracy and high recall for the minority class. This approach outperforms traditional resampling techniques and deep learning models, offering a scalable and interpretable solution for early diabetes prediction and potentially other medical applications.</p>\",\"PeriodicalId\":41392,\"journal\":{\"name\":\"Ewha Medical Journal\",\"volume\":\"48 2\",\"pages\":\"e32\"},\"PeriodicalIF\":0.2000,\"publicationDate\":\"2025-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12277495/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Ewha Medical Journal\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.12771/emj.2025.00353\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/4/15 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q3\",\"JCRName\":\"MEDICINE, GENERAL & INTERNAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ewha Medical Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.12771/emj.2025.00353","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/4/15 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"MEDICINE, GENERAL & INTERNAL","Score":null,"Total":0}
引用次数: 0

摘要

目的:本研究开发并评估了一种基于特征的集成模型,该模型将合成少数过采样技术(SMOTE)和随机欠采样(RUS)方法与随机森林方法相结合,以解决机器学习中早期糖尿病检测中的类别不平衡问题,旨在提高预测性能。方法:使用Scikit-learn糖尿病数据集(442个样本,10个特征),我们在第75百分位对目标变量(糖尿病进展)进行二值化,并使用分层抽样将其分割为80:20。通过SMOTE(0.6)和RUS(0.66)将训练集平衡为1:2的少数与多数比例。通过训练随机森林分类器对10个基于特征重要性的双特征子集进行训练,并使用软投票将它们的输出组合在一起,构建了基于特征的集成模型。将性能与13个基线模型进行比较,使用准确性和曲线下面积(AUC)作为不平衡测试集的指标。结果:基于特征的集成模型和平衡随机森林的准确率最高(0.8764),全连接神经网络次之(0.8700)。集成模型的AUC(0.9227)很好,而k近邻模型的精度最低(0.8427)。可视化证实了其优越的辨别能力,特别是对少数(高风险)类别,这是医疗环境中的关键因素。结论:整合SMOTE、RUS和基于特征的集成学习,通过对少数类别提供稳健的准确率和高召回率,提高了不平衡糖尿病数据集的分类性能。这种方法优于传统的重新采样技术和深度学习模型,为早期糖尿病预测和潜在的其他医疗应用提供了可扩展和可解释的解决方案。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

Feature-based ensemble modeling for addressing diabetes data imbalance using the SMOTE, RUS, and random forest methods: a prediction study.

Feature-based ensemble modeling for addressing diabetes data imbalance using the SMOTE, RUS, and random forest methods: a prediction study.

Feature-based ensemble modeling for addressing diabetes data imbalance using the SMOTE, RUS, and random forest methods: a prediction study.

Feature-based ensemble modeling for addressing diabetes data imbalance using the SMOTE, RUS, and random forest methods: a prediction study.

Purpose: This study developed and evaluated a feature-based ensemble model integrating the synthetic minority oversampling technique (SMOTE) and random undersampling (RUS) methods with a random forest approach to address class imbalance in machine learning for early diabetes detection, aiming to improve predictive performance.

Methods: Using the Scikit-learn diabetes dataset (442 samples, 10 features), we binarized the target variable (diabetes progression) at the 75th percentile and split it 80:20 using stratified sampling. The training set was balanced to a 1:2 minority-to-majority ratio via SMOTE (0.6) and RUS (0.66). A feature-based ensemble model was constructed by training random forest classifiers on 10 two-feature subsets, selected based on feature importance, and combining their outputs using soft voting. Performance was compared against 13 baseline models, using accuracy and area under the curve (AUC) as metrics on the imbalanced test set.

Results: The feature-based ensemble model and balanced random forest both achieved the highest accuracy (0.8764), followed by the fully connected neural network (0.8700). The ensemble model had an excellent AUC (0.9227), while k-nearest neighbors had the lowest accuracy (0.8427). Visualizations confirmed its superior discriminative ability, especially for the minority (high-risk) class, which is a critical factor in medical contexts.

Conclusion: Integrating SMOTE, RUS, and feature-based ensemble learning improved classification performance in imbalanced diabetes datasets by delivering robust accuracy and high recall for the minority class. This approach outperforms traditional resampling techniques and deep learning models, offering a scalable and interpretable solution for early diabetes prediction and potentially other medical applications.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Ewha Medical Journal
Ewha Medical Journal MEDICINE, GENERAL & INTERNAL-
自引率
33.30%
发文量
28
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信