A Biomarker-Driven and Interpretable Machine Learning Model for Diagnosing Diabetes Mellitus

IF 3.5 2区 农林科学 Q2 FOOD SCIENCE & TECHNOLOGY
Zhihui Xiao, Mingfu Wang, Yueliang Zhao, Hui Wang
{"title":"A Biomarker-Driven and Interpretable Machine Learning Model for Diagnosing Diabetes Mellitus","authors":"Zhihui Xiao,&nbsp;Mingfu Wang,&nbsp;Yueliang Zhao,&nbsp;Hui Wang","doi":"10.1002/fsn3.70234","DOIUrl":null,"url":null,"abstract":"<p>Diabetes is one of the leading causes of death and disability worldwide. Developing earlier and more accurate diagnosis methods is crucial for clinical prevention and treatment of diabetes. Here, data on biochemical indicators and physiological characteristics of 4335 participants from the National Health and Nutrition Examination Survey (NHANES) database from 2017 to 2020 were collected. After data preprocessing, the dataset was randomly divided into a training set (70%) and a test set (30%); then the Boruta algorithm was used to screen feature indicators on the training set. Next, three machine learning algorithms, including Random Forest (RF), Multi-Layer Perceptron (MLP), and Extreme Gradient Boosting (XGBoost) were employed to build predictive models through 10-fold cross-validation on the training dataset, followed by performance evaluation on the test dataset. The RF model exhibited the best performance, with an area under the curve (AUC) of 0.958 (95% CI: 0.943–0.973), a recall of 0.897, a specificity and F1 score of 0.916 and 0.747, respectively, and an overall accuracy of 0.913. Moreover, SHapley Additive exPlanations (SHAP) and Partial Dependency Plots (PDP) were applied to interpret the RF model to analyze the risk factors for diabetes. Glycohemoglobin, glucose, fasting glucose, age, cholesterol, osmolality, BMI, blood urea nitrogen, and insulin were found to exert the greatest influence on the prevalence of diabetes. Collectively, the RF model has considerable application prospects for the diagnosis of diabetes and can serve as a valuable supplementary tool for clinical diagnosis and risk assessment in diabetes.</p>","PeriodicalId":12418,"journal":{"name":"Food Science & Nutrition","volume":"13 5","pages":""},"PeriodicalIF":3.5000,"publicationDate":"2025-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/fsn3.70234","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Food Science & Nutrition","FirstCategoryId":"97","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/fsn3.70234","RegionNum":2,"RegionCategory":"农林科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"FOOD SCIENCE & TECHNOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Diabetes is one of the leading causes of death and disability worldwide. Developing earlier and more accurate diagnosis methods is crucial for clinical prevention and treatment of diabetes. Here, data on biochemical indicators and physiological characteristics of 4335 participants from the National Health and Nutrition Examination Survey (NHANES) database from 2017 to 2020 were collected. After data preprocessing, the dataset was randomly divided into a training set (70%) and a test set (30%); then the Boruta algorithm was used to screen feature indicators on the training set. Next, three machine learning algorithms, including Random Forest (RF), Multi-Layer Perceptron (MLP), and Extreme Gradient Boosting (XGBoost) were employed to build predictive models through 10-fold cross-validation on the training dataset, followed by performance evaluation on the test dataset. The RF model exhibited the best performance, with an area under the curve (AUC) of 0.958 (95% CI: 0.943–0.973), a recall of 0.897, a specificity and F1 score of 0.916 and 0.747, respectively, and an overall accuracy of 0.913. Moreover, SHapley Additive exPlanations (SHAP) and Partial Dependency Plots (PDP) were applied to interpret the RF model to analyze the risk factors for diabetes. Glycohemoglobin, glucose, fasting glucose, age, cholesterol, osmolality, BMI, blood urea nitrogen, and insulin were found to exert the greatest influence on the prevalence of diabetes. Collectively, the RF model has considerable application prospects for the diagnosis of diabetes and can serve as a valuable supplementary tool for clinical diagnosis and risk assessment in diabetes.

Abstract Image

诊断糖尿病的生物标志物驱动和可解释的机器学习模型
糖尿病是全世界导致死亡和残疾的主要原因之一。发展早期和更准确的诊断方法对糖尿病的临床预防和治疗至关重要。本文收集了2017 - 2020年全国健康与营养检查调查(NHANES)数据库中4335名参与者的生化指标和生理特征数据。数据预处理后,将数据集随机分为训练集(70%)和测试集(30%);然后使用Boruta算法筛选训练集上的特征指标。接下来,采用随机森林(Random Forest, RF)、多层感知器(Multi-Layer Perceptron, MLP)和极端梯度增强(Extreme Gradient Boosting, XGBoost)三种机器学习算法在训练数据集上进行10次交叉验证,构建预测模型,并在测试数据集上进行性能评估。RF模型的曲线下面积(AUC)为0.958 (95% CI: 0.943 ~ 0.973),召回率为0.897,特异性和F1评分分别为0.916和0.747,总体准确率为0.913。此外,应用SHapley加性解释(SHAP)和部分依赖图(PDP)对RF模型进行解释,分析糖尿病的危险因素。糖蛋白、葡萄糖、空腹葡萄糖、年龄、胆固醇、渗透压、BMI、血尿素氮和胰岛素对糖尿病的患病率影响最大。综上所述,RF模型在糖尿病的诊断中具有相当大的应用前景,可以作为糖尿病临床诊断和风险评估的有价值的补充工具。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Food Science & Nutrition
Food Science & Nutrition Agricultural and Biological Sciences-Food Science
CiteScore
7.40
自引率
5.10%
发文量
434
审稿时长
24 weeks
期刊介绍: Food Science & Nutrition is the peer-reviewed journal for rapid dissemination of research in all areas of food science and nutrition. The Journal will consider submissions of quality papers describing the results of fundamental and applied research related to all aspects of human food and nutrition, as well as interdisciplinary research that spans these two fields.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信