Feature importance and model performance for prediabetes prediction: A comparative study

IF 3.6 3区综合性期刊 Q1 MULTIDISCIPLINARY SCIENCES

Journal of King Saud University - Science Pub Date : 2024-12-01 DOI:10.1016/j.jksus.2024.103583

Saeed Awad M Alqahtani , Hussah M Alobaid , Jamilah Alshammari , Safa A Alqarzae , Sheka Yagub Aloyouni , Ahood A. Al-Eidan , Salwa Alhamad , Abeer Almiman , Fadwa M Alkhulaifi , Suliman Alomar

{"title":"Feature importance and model performance for prediabetes prediction: A comparative study","authors":"Saeed Awad M Alqahtani , Hussah M Alobaid , Jamilah Alshammari , Safa A Alqarzae , Sheka Yagub Aloyouni , Ahood A. Al-Eidan , Salwa Alhamad , Abeer Almiman , Fadwa M Alkhulaifi , Suliman Alomar","doi":"10.1016/j.jksus.2024.103583","DOIUrl":null,"url":null,"abstract":"<div><h3>Objectives</h3><div>Prediabetes is a significant health condition that elevates the risk of developing type 2 diabetes and other associated complications. This study aims to (1) explore the potential of machine learning models to improve the prediction of prediabetes, (2) compare the performance of various machine learning models with traditional regression methods, and (3) identify the most influential demographic, socioeconomic, and health-related factors associated with prediabetes.</div></div><div><h3>Methods</h3><div>This study utilized data from the 2021 Behavioral Risk Factor Surveillance System (BRFSS) and employed comprehensive data preprocessing techniques. Logistic regression analysis was conducted to assess correlations between features and prediabetes risk. Feature importance was quantified using Adjusted Mutual Information values. Multiple machine learning models, including Random Forest, K Nearest Neighbors (KNN), Extreme Gradient Boosting (XGBoost), Neural Network, and Logistic Regression, were used for prediction. The best model was selected and validated through cross-validation to ensure robustness.</div></div><div><h3>Results</h3><div>Significant associations were observed between prediabetes and key predictors such as cholesterol levels, BMI categories, hypertension status, age groups, and income categories. Among the models tested, Random Forest demonstrated the highest accuracy and robustness, outperforming traditional regression models.</div></div><div><h3>Conclusions</h3><div>This study highlights the potential of machine learning to enhance prediabetes prediction and underscores the importance of identifying high-risk individuals for early intervention. The findings contribute to population health strategies by integrating advanced analytical methods with public health data.</div></div>","PeriodicalId":16205,"journal":{"name":"Journal of King Saud University - Science","volume":"36 11","pages":"Article 103583"},"PeriodicalIF":3.6000,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of King Saud University - Science","FirstCategoryId":"103","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1018364724004956","RegionNum":3,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

Objectives

Prediabetes is a significant health condition that elevates the risk of developing type 2 diabetes and other associated complications. This study aims to (1) explore the potential of machine learning models to improve the prediction of prediabetes, (2) compare the performance of various machine learning models with traditional regression methods, and (3) identify the most influential demographic, socioeconomic, and health-related factors associated with prediabetes.

Methods

This study utilized data from the 2021 Behavioral Risk Factor Surveillance System (BRFSS) and employed comprehensive data preprocessing techniques. Logistic regression analysis was conducted to assess correlations between features and prediabetes risk. Feature importance was quantified using Adjusted Mutual Information values. Multiple machine learning models, including Random Forest, K Nearest Neighbors (KNN), Extreme Gradient Boosting (XGBoost), Neural Network, and Logistic Regression, were used for prediction. The best model was selected and validated through cross-validation to ensure robustness.

Results

Significant associations were observed between prediabetes and key predictors such as cholesterol levels, BMI categories, hypertension status, age groups, and income categories. Among the models tested, Random Forest demonstrated the highest accuracy and robustness, outperforming traditional regression models.

Conclusions

This study highlights the potential of machine learning to enhance prediabetes prediction and underscores the importance of identifying high-risk individuals for early intervention. The findings contribute to population health strategies by integrating advanced analytical methods with public health data.

查看原文本刊更多论文

特征重要性和模型性能在糖尿病前期预测中的比较研究

糖尿病前期是一种重要的健康状况，可增加患2型糖尿病和其他相关并发症的风险。本研究旨在(1)探索机器学习模型在改善前驱糖尿病预测方面的潜力，(2)比较各种机器学习模型与传统回归方法的性能，以及(3)确定与前驱糖尿病相关的最具影响力的人口统计学、社会经济和健康相关因素。方法利用2021年行为风险因素监测系统（BRFSS）的数据，采用综合数据预处理技术。Logistic回归分析评估特征与前驱糖尿病风险之间的相关性。使用调整互信息值对特征重要性进行量化。多种机器学习模型，包括随机森林、K近邻（KNN）、极端梯度增强（XGBoost）、神经网络和逻辑回归，被用于预测。选择最佳模型，并通过交叉验证对模型进行验证，确保模型的鲁棒性。结果糖尿病前期与胆固醇水平、BMI类别、高血压状况、年龄组和收入类别等关键预测因素之间存在显著相关性。在测试的模型中，随机森林显示出最高的准确性和鲁棒性，优于传统的回归模型。本研究强调了机器学习增强糖尿病前期预测的潜力，并强调了识别高危人群进行早期干预的重要性。研究结果通过将先进的分析方法与公共卫生数据相结合，有助于制定人口健康战略。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of King Saud University - Science Multidisciplinary-Multidisciplinary

CiteScore

7.20

自引率

2.60%

发文量

642

审稿时长

49 days

期刊介绍： Journal of King Saud University – Science is an official refereed publication of King Saud University and the publishing services is provided by Elsevier. It publishes peer-reviewed research articles in the fields of physics, astronomy, mathematics, statistics, chemistry, biochemistry, earth sciences, life and environmental sciences on the basis of scientific originality and interdisciplinary interest. It is devoted primarily to research papers but short communications, reviews and book reviews are also included. The editorial board and associated editors, composed of prominent scientists from around the world, are representative of the disciplines covered by the journal.