{"title":"Enhancing diabetes risk prediction: A comparative evaluation of bagging, boosting, and ensemble classifiers with SMOTE oversampling","authors":"Rabia Asif , Darshana Upadhyay , Marzia Zaman , Srini Sampalli","doi":"10.1016/j.imu.2025.101661","DOIUrl":null,"url":null,"abstract":"<div><div>Diabetes is a major global health concern, with millions of individuals at risk of developing this chronic condition. Early prediction and intervention are essential for effective diabetes management. This study explores advanced machine learning techniques, specifically bagging, boosting, and ensemble methods to improve diabetes risk prediction. Using three diverse datasets, namely, the Centers for Disease Control and Prevention (CDC) Diabetes Health Indicators dataset, the Early Stage Diabetes Risk Prediction System (ESDRP) dataset, and the PIMA Indian Diabetes dataset are utilized to evaluate the adaptability and robustness of the proposed models. Our approach addresses critical gaps in existing research, including the handling of highly imbalanced datasets through the Synthetic Minority Over-sampling Technique (SMOTE), the necessity of feature selection, and the underutilization of the CDC dataset in diabetes studies. We find that applying SMOTE to the CDC dataset significantly enhances model performance, with the CATBoost algorithm achieving an accuracy of 91 %. For the ESRPS dataset, ensemble methods demonstrate even stronger results, achieving 98 % accuracy using the top five features. This study not only contributes to the development of more accurate predictive models for diabetes risk but also provides insights into enhancing the robustness of machine learning methods in healthcare.</div></div>","PeriodicalId":13953,"journal":{"name":"Informatics in Medicine Unlocked","volume":"57 ","pages":"Article 101661"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Informatics in Medicine Unlocked","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2352914825000498","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Medicine","Score":null,"Total":0}
引用次数: 0
Abstract
Diabetes is a major global health concern, with millions of individuals at risk of developing this chronic condition. Early prediction and intervention are essential for effective diabetes management. This study explores advanced machine learning techniques, specifically bagging, boosting, and ensemble methods to improve diabetes risk prediction. Using three diverse datasets, namely, the Centers for Disease Control and Prevention (CDC) Diabetes Health Indicators dataset, the Early Stage Diabetes Risk Prediction System (ESDRP) dataset, and the PIMA Indian Diabetes dataset are utilized to evaluate the adaptability and robustness of the proposed models. Our approach addresses critical gaps in existing research, including the handling of highly imbalanced datasets through the Synthetic Minority Over-sampling Technique (SMOTE), the necessity of feature selection, and the underutilization of the CDC dataset in diabetes studies. We find that applying SMOTE to the CDC dataset significantly enhances model performance, with the CATBoost algorithm achieving an accuracy of 91 %. For the ESRPS dataset, ensemble methods demonstrate even stronger results, achieving 98 % accuracy using the top five features. This study not only contributes to the development of more accurate predictive models for diabetes risk but also provides insights into enhancing the robustness of machine learning methods in healthcare.
期刊介绍:
Informatics in Medicine Unlocked (IMU) is an international gold open access journal covering a broad spectrum of topics within medical informatics, including (but not limited to) papers focusing on imaging, pathology, teledermatology, public health, ophthalmological, nursing and translational medicine informatics. The full papers that are published in the journal are accessible to all who visit the website.