Enhancing diabetes risk prediction: A comparative evaluation of bagging, boosting, and ensemble classifiers with SMOTE oversampling

Q1 Medicine

Informatics in Medicine Unlocked Pub Date : 2025-01-01 DOI:10.1016/j.imu.2025.101661

Rabia Asif , Darshana Upadhyay , Marzia Zaman , Srini Sampalli

{"title":"Enhancing diabetes risk prediction: A comparative evaluation of bagging, boosting, and ensemble classifiers with SMOTE oversampling","authors":"Rabia Asif , Darshana Upadhyay , Marzia Zaman , Srini Sampalli","doi":"10.1016/j.imu.2025.101661","DOIUrl":null,"url":null,"abstract":"<div><div>Diabetes is a major global health concern, with millions of individuals at risk of developing this chronic condition. Early prediction and intervention are essential for effective diabetes management. This study explores advanced machine learning techniques, specifically bagging, boosting, and ensemble methods to improve diabetes risk prediction. Using three diverse datasets, namely, the Centers for Disease Control and Prevention (CDC) Diabetes Health Indicators dataset, the Early Stage Diabetes Risk Prediction System (ESDRP) dataset, and the PIMA Indian Diabetes dataset are utilized to evaluate the adaptability and robustness of the proposed models. Our approach addresses critical gaps in existing research, including the handling of highly imbalanced datasets through the Synthetic Minority Over-sampling Technique (SMOTE), the necessity of feature selection, and the underutilization of the CDC dataset in diabetes studies. We find that applying SMOTE to the CDC dataset significantly enhances model performance, with the CATBoost algorithm achieving an accuracy of 91 %. For the ESRPS dataset, ensemble methods demonstrate even stronger results, achieving 98 % accuracy using the top five features. This study not only contributes to the development of more accurate predictive models for diabetes risk but also provides insights into enhancing the robustness of machine learning methods in healthcare.</div></div>","PeriodicalId":13953,"journal":{"name":"Informatics in Medicine Unlocked","volume":"57 ","pages":"Article 101661"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Informatics in Medicine Unlocked","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2352914825000498","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Medicine","Score":null,"Total":0}

引用次数: 0

Abstract

Diabetes is a major global health concern, with millions of individuals at risk of developing this chronic condition. Early prediction and intervention are essential for effective diabetes management. This study explores advanced machine learning techniques, specifically bagging, boosting, and ensemble methods to improve diabetes risk prediction. Using three diverse datasets, namely, the Centers for Disease Control and Prevention (CDC) Diabetes Health Indicators dataset, the Early Stage Diabetes Risk Prediction System (ESDRP) dataset, and the PIMA Indian Diabetes dataset are utilized to evaluate the adaptability and robustness of the proposed models. Our approach addresses critical gaps in existing research, including the handling of highly imbalanced datasets through the Synthetic Minority Over-sampling Technique (SMOTE), the necessity of feature selection, and the underutilization of the CDC dataset in diabetes studies. We find that applying SMOTE to the CDC dataset significantly enhances model performance, with the CATBoost algorithm achieving an accuracy of 91 %. For the ESRPS dataset, ensemble methods demonstrate even stronger results, achieving 98 % accuracy using the top five features. This study not only contributes to the development of more accurate predictive models for diabetes risk but also provides insights into enhancing the robustness of machine learning methods in healthcare.

查看原文本刊更多论文

增强糖尿病风险预测：用SMOTE过采样对bagging、boosting和ensemble分类器进行比较评价

糖尿病是一个主要的全球健康问题，数百万人有患这种慢性疾病的风险。早期预测和干预对于有效的糖尿病管理至关重要。本研究探索先进的机器学习技术，特别是bagging、boosting和ensemble方法来改善糖尿病风险预测。使用三个不同的数据集，即疾病控制和预防中心（CDC）糖尿病健康指标数据集，早期糖尿病风险预测系统（ESDRP）数据集和PIMA印度糖尿病数据集来评估所提出模型的适应性和鲁棒性。我们的方法解决了现有研究中的关键空白，包括通过合成少数派过采样技术（SMOTE）处理高度不平衡的数据集，特征选择的必要性，以及糖尿病研究中CDC数据集的未充分利用。我们发现，将SMOTE应用于CDC数据集可以显著提高模型性能，CATBoost算法的准确率达到91%。对于ESRPS数据集，集成方法显示出更强的结果，使用前五个特征达到98%的准确率。这项研究不仅有助于开发更准确的糖尿病风险预测模型，而且还为增强医疗保健领域机器学习方法的鲁棒性提供了见解。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Informatics in Medicine Unlocked Medicine-Health Informatics

CiteScore

9.50

自引率

0.00%

发文量

282

审稿时长

39 days

期刊介绍： Informatics in Medicine Unlocked (IMU) is an international gold open access journal covering a broad spectrum of topics within medical informatics, including (but not limited to) papers focusing on imaging, pathology, teledermatology, public health, ophthalmological, nursing and translational medicine informatics. The full papers that are published in the journal are accessible to all who visit the website.