Enhancing diabetes risk prediction: A comparative evaluation of bagging, boosting, and ensemble classifiers with SMOTE oversampling

Q1 Medicine
Rabia Asif , Darshana Upadhyay , Marzia Zaman , Srini Sampalli
{"title":"Enhancing diabetes risk prediction: A comparative evaluation of bagging, boosting, and ensemble classifiers with SMOTE oversampling","authors":"Rabia Asif ,&nbsp;Darshana Upadhyay ,&nbsp;Marzia Zaman ,&nbsp;Srini Sampalli","doi":"10.1016/j.imu.2025.101661","DOIUrl":null,"url":null,"abstract":"<div><div>Diabetes is a major global health concern, with millions of individuals at risk of developing this chronic condition. Early prediction and intervention are essential for effective diabetes management. This study explores advanced machine learning techniques, specifically bagging, boosting, and ensemble methods to improve diabetes risk prediction. Using three diverse datasets, namely, the Centers for Disease Control and Prevention (CDC) Diabetes Health Indicators dataset, the Early Stage Diabetes Risk Prediction System (ESDRP) dataset, and the PIMA Indian Diabetes dataset are utilized to evaluate the adaptability and robustness of the proposed models. Our approach addresses critical gaps in existing research, including the handling of highly imbalanced datasets through the Synthetic Minority Over-sampling Technique (SMOTE), the necessity of feature selection, and the underutilization of the CDC dataset in diabetes studies. We find that applying SMOTE to the CDC dataset significantly enhances model performance, with the CATBoost algorithm achieving an accuracy of 91 %. For the ESRPS dataset, ensemble methods demonstrate even stronger results, achieving 98 % accuracy using the top five features. This study not only contributes to the development of more accurate predictive models for diabetes risk but also provides insights into enhancing the robustness of machine learning methods in healthcare.</div></div>","PeriodicalId":13953,"journal":{"name":"Informatics in Medicine Unlocked","volume":"57 ","pages":"Article 101661"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Informatics in Medicine Unlocked","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2352914825000498","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Medicine","Score":null,"Total":0}
引用次数: 0

Abstract

Diabetes is a major global health concern, with millions of individuals at risk of developing this chronic condition. Early prediction and intervention are essential for effective diabetes management. This study explores advanced machine learning techniques, specifically bagging, boosting, and ensemble methods to improve diabetes risk prediction. Using three diverse datasets, namely, the Centers for Disease Control and Prevention (CDC) Diabetes Health Indicators dataset, the Early Stage Diabetes Risk Prediction System (ESDRP) dataset, and the PIMA Indian Diabetes dataset are utilized to evaluate the adaptability and robustness of the proposed models. Our approach addresses critical gaps in existing research, including the handling of highly imbalanced datasets through the Synthetic Minority Over-sampling Technique (SMOTE), the necessity of feature selection, and the underutilization of the CDC dataset in diabetes studies. We find that applying SMOTE to the CDC dataset significantly enhances model performance, with the CATBoost algorithm achieving an accuracy of 91 %. For the ESRPS dataset, ensemble methods demonstrate even stronger results, achieving 98 % accuracy using the top five features. This study not only contributes to the development of more accurate predictive models for diabetes risk but also provides insights into enhancing the robustness of machine learning methods in healthcare.
增强糖尿病风险预测:用SMOTE过采样对bagging、boosting和ensemble分类器进行比较评价
糖尿病是一个主要的全球健康问题,数百万人有患这种慢性疾病的风险。早期预测和干预对于有效的糖尿病管理至关重要。本研究探索先进的机器学习技术,特别是bagging、boosting和ensemble方法来改善糖尿病风险预测。使用三个不同的数据集,即疾病控制和预防中心(CDC)糖尿病健康指标数据集,早期糖尿病风险预测系统(ESDRP)数据集和PIMA印度糖尿病数据集来评估所提出模型的适应性和鲁棒性。我们的方法解决了现有研究中的关键空白,包括通过合成少数派过采样技术(SMOTE)处理高度不平衡的数据集,特征选择的必要性,以及糖尿病研究中CDC数据集的未充分利用。我们发现,将SMOTE应用于CDC数据集可以显著提高模型性能,CATBoost算法的准确率达到91%。对于ESRPS数据集,集成方法显示出更强的结果,使用前五个特征达到98%的准确率。这项研究不仅有助于开发更准确的糖尿病风险预测模型,而且还为增强医疗保健领域机器学习方法的鲁棒性提供了见解。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Informatics in Medicine Unlocked
Informatics in Medicine Unlocked Medicine-Health Informatics
CiteScore
9.50
自引率
0.00%
发文量
282
审稿时长
39 days
期刊介绍: Informatics in Medicine Unlocked (IMU) is an international gold open access journal covering a broad spectrum of topics within medical informatics, including (but not limited to) papers focusing on imaging, pathology, teledermatology, public health, ophthalmological, nursing and translational medicine informatics. The full papers that are published in the journal are accessible to all who visit the website.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信