{"title":"诊断糖尿病的生物标志物驱动和可解释的机器学习模型","authors":"Zhihui Xiao, Mingfu Wang, Yueliang Zhao, Hui Wang","doi":"10.1002/fsn3.70234","DOIUrl":null,"url":null,"abstract":"<p>Diabetes is one of the leading causes of death and disability worldwide. Developing earlier and more accurate diagnosis methods is crucial for clinical prevention and treatment of diabetes. Here, data on biochemical indicators and physiological characteristics of 4335 participants from the National Health and Nutrition Examination Survey (NHANES) database from 2017 to 2020 were collected. After data preprocessing, the dataset was randomly divided into a training set (70%) and a test set (30%); then the Boruta algorithm was used to screen feature indicators on the training set. Next, three machine learning algorithms, including Random Forest (RF), Multi-Layer Perceptron (MLP), and Extreme Gradient Boosting (XGBoost) were employed to build predictive models through 10-fold cross-validation on the training dataset, followed by performance evaluation on the test dataset. The RF model exhibited the best performance, with an area under the curve (AUC) of 0.958 (95% CI: 0.943–0.973), a recall of 0.897, a specificity and F1 score of 0.916 and 0.747, respectively, and an overall accuracy of 0.913. Moreover, SHapley Additive exPlanations (SHAP) and Partial Dependency Plots (PDP) were applied to interpret the RF model to analyze the risk factors for diabetes. Glycohemoglobin, glucose, fasting glucose, age, cholesterol, osmolality, BMI, blood urea nitrogen, and insulin were found to exert the greatest influence on the prevalence of diabetes. Collectively, the RF model has considerable application prospects for the diagnosis of diabetes and can serve as a valuable supplementary tool for clinical diagnosis and risk assessment in diabetes.</p>","PeriodicalId":12418,"journal":{"name":"Food Science & Nutrition","volume":"13 5","pages":""},"PeriodicalIF":3.5000,"publicationDate":"2025-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/fsn3.70234","citationCount":"0","resultStr":"{\"title\":\"A Biomarker-Driven and Interpretable Machine Learning Model for Diagnosing Diabetes Mellitus\",\"authors\":\"Zhihui Xiao, Mingfu Wang, Yueliang Zhao, Hui Wang\",\"doi\":\"10.1002/fsn3.70234\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Diabetes is one of the leading causes of death and disability worldwide. Developing earlier and more accurate diagnosis methods is crucial for clinical prevention and treatment of diabetes. Here, data on biochemical indicators and physiological characteristics of 4335 participants from the National Health and Nutrition Examination Survey (NHANES) database from 2017 to 2020 were collected. After data preprocessing, the dataset was randomly divided into a training set (70%) and a test set (30%); then the Boruta algorithm was used to screen feature indicators on the training set. Next, three machine learning algorithms, including Random Forest (RF), Multi-Layer Perceptron (MLP), and Extreme Gradient Boosting (XGBoost) were employed to build predictive models through 10-fold cross-validation on the training dataset, followed by performance evaluation on the test dataset. The RF model exhibited the best performance, with an area under the curve (AUC) of 0.958 (95% CI: 0.943–0.973), a recall of 0.897, a specificity and F1 score of 0.916 and 0.747, respectively, and an overall accuracy of 0.913. Moreover, SHapley Additive exPlanations (SHAP) and Partial Dependency Plots (PDP) were applied to interpret the RF model to analyze the risk factors for diabetes. Glycohemoglobin, glucose, fasting glucose, age, cholesterol, osmolality, BMI, blood urea nitrogen, and insulin were found to exert the greatest influence on the prevalence of diabetes. Collectively, the RF model has considerable application prospects for the diagnosis of diabetes and can serve as a valuable supplementary tool for clinical diagnosis and risk assessment in diabetes.</p>\",\"PeriodicalId\":12418,\"journal\":{\"name\":\"Food Science & Nutrition\",\"volume\":\"13 5\",\"pages\":\"\"},\"PeriodicalIF\":3.5000,\"publicationDate\":\"2025-04-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://onlinelibrary.wiley.com/doi/epdf/10.1002/fsn3.70234\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Food Science & Nutrition\",\"FirstCategoryId\":\"97\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1002/fsn3.70234\",\"RegionNum\":2,\"RegionCategory\":\"农林科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"FOOD SCIENCE & TECHNOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Food Science & Nutrition","FirstCategoryId":"97","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/fsn3.70234","RegionNum":2,"RegionCategory":"农林科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"FOOD SCIENCE & TECHNOLOGY","Score":null,"Total":0}
A Biomarker-Driven and Interpretable Machine Learning Model for Diagnosing Diabetes Mellitus
Diabetes is one of the leading causes of death and disability worldwide. Developing earlier and more accurate diagnosis methods is crucial for clinical prevention and treatment of diabetes. Here, data on biochemical indicators and physiological characteristics of 4335 participants from the National Health and Nutrition Examination Survey (NHANES) database from 2017 to 2020 were collected. After data preprocessing, the dataset was randomly divided into a training set (70%) and a test set (30%); then the Boruta algorithm was used to screen feature indicators on the training set. Next, three machine learning algorithms, including Random Forest (RF), Multi-Layer Perceptron (MLP), and Extreme Gradient Boosting (XGBoost) were employed to build predictive models through 10-fold cross-validation on the training dataset, followed by performance evaluation on the test dataset. The RF model exhibited the best performance, with an area under the curve (AUC) of 0.958 (95% CI: 0.943–0.973), a recall of 0.897, a specificity and F1 score of 0.916 and 0.747, respectively, and an overall accuracy of 0.913. Moreover, SHapley Additive exPlanations (SHAP) and Partial Dependency Plots (PDP) were applied to interpret the RF model to analyze the risk factors for diabetes. Glycohemoglobin, glucose, fasting glucose, age, cholesterol, osmolality, BMI, blood urea nitrogen, and insulin were found to exert the greatest influence on the prevalence of diabetes. Collectively, the RF model has considerable application prospects for the diagnosis of diabetes and can serve as a valuable supplementary tool for clinical diagnosis and risk assessment in diabetes.
期刊介绍:
Food Science & Nutrition is the peer-reviewed journal for rapid dissemination of research in all areas of food science and nutrition. The Journal will consider submissions of quality papers describing the results of fundamental and applied research related to all aspects of human food and nutrition, as well as interdisciplinary research that spans these two fields.