Development and Validation of Machine Learning Models for Identifying Prediabetes and Diabetes in Normoglycemia

IF 6 2区医学 Q1 ENDOCRINOLOGY & METABOLISM

Diabetes/Metabolism Research and Reviews Pub Date : 2024-11-04 DOI:10.1002/dmrr.70003

Xiaodong Zhang, Weidong Yao, Dawei Wang, Wenqi Hu, Guang Zhang, Yongsheng Zhang

{"title":"Development and Validation of Machine Learning Models for Identifying Prediabetes and Diabetes in Normoglycemia","authors":"Xiaodong Zhang, Weidong Yao, Dawei Wang, Wenqi Hu, Guang Zhang, Yongsheng Zhang","doi":"10.1002/dmrr.70003","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Background</h3>\n \n <p>Prediabetes and diabetes are both abnormal states of glucose metabolism (AGM) that can lead to severe complications. Early detection of AGM is crucial for timely intervention and treatment. However, fasting blood glucose (FBG) as a mass population screening method may fail to identify some individuals who are actually AGM but with normoglycemia. This study aimed to develop and validate machine learning (ML) models to identify AGM among individuals with normoglycemia using routine health check-up indicators.</p>\n </section>\n \n <section>\n \n <h3> Methods</h3>\n \n <p>According to the American Diabetes Association (ADA) criteria, participants with normoglycemia (FBG ≤ 5.6 mmol/L) were collected from 2019 to 2023, and then divided into AGM and Normal groups using glycosylated haemoglobin (HbA1c) 5.7% as the threshold. Data from 2019 to 2022 were divided into training and internal validation sets at a 7:3 ratio, while data from 2023 were used as the external validation set. Seven ML algorithms—including logistic regression (LR), random forest (RF), support vector machine (SVM), extreme gradient boosting machine, multilayer perceptron (MLP), light gradient boosting machine (LightGBM), and categorical boosting (CatBoost)—were used to build models for identifying AGM in normoglycemia population. Model performance was evaluated using the area under the receiver operating characteristic curve (auROC) and the precision-recall curve (auPR). The feature contributions to the optimal model was visualised using the SHapley Additive exPlanations (SHAP). Finally, an intuitive and user-friendly interactive interface was developed.</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>A total of 59,259 participants were finally enroled in this study, and then divided into the training set of 32,810, the internal validation set of 14,060, and the external validation set of 12,389. The Catboost model outperformed the others with auROC of 0.806 and 0.794 for the internal and external validation set, respectively. Age was the most important feature influencing the performance of the CatBoost model, followed by fasting blood glucose, red blood cells, haemoglobin, body mass index, and triglyceride-glucose.</p>\n </section>\n \n <section>\n \n <h3> Conclusion</h3>\n \n <p>A well-performed ML model to identify AGM in the normoglycemia population was built, offering significant potential for early intervention and treatment of AGM that would otherwise have been missed.</p>\n </section>\n </div>","PeriodicalId":11335,"journal":{"name":"Diabetes/Metabolism Research and Reviews","volume":"40 8","pages":""},"PeriodicalIF":6.0000,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/dmrr.70003","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Diabetes/Metabolism Research and Reviews","FirstCategoryId":"3","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/dmrr.70003","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENDOCRINOLOGY & METABOLISM","Score":null,"Total":0}

引用次数: 0

Abstract

Background

Prediabetes and diabetes are both abnormal states of glucose metabolism (AGM) that can lead to severe complications. Early detection of AGM is crucial for timely intervention and treatment. However, fasting blood glucose (FBG) as a mass population screening method may fail to identify some individuals who are actually AGM but with normoglycemia. This study aimed to develop and validate machine learning (ML) models to identify AGM among individuals with normoglycemia using routine health check-up indicators.

Methods

According to the American Diabetes Association (ADA) criteria, participants with normoglycemia (FBG ≤ 5.6 mmol/L) were collected from 2019 to 2023, and then divided into AGM and Normal groups using glycosylated haemoglobin (HbA1c) 5.7% as the threshold. Data from 2019 to 2022 were divided into training and internal validation sets at a 7:3 ratio, while data from 2023 were used as the external validation set. Seven ML algorithms—including logistic regression (LR), random forest (RF), support vector machine (SVM), extreme gradient boosting machine, multilayer perceptron (MLP), light gradient boosting machine (LightGBM), and categorical boosting (CatBoost)—were used to build models for identifying AGM in normoglycemia population. Model performance was evaluated using the area under the receiver operating characteristic curve (auROC) and the precision-recall curve (auPR). The feature contributions to the optimal model was visualised using the SHapley Additive exPlanations (SHAP). Finally, an intuitive and user-friendly interactive interface was developed.

Results

A total of 59,259 participants were finally enroled in this study, and then divided into the training set of 32,810, the internal validation set of 14,060, and the external validation set of 12,389. The Catboost model outperformed the others with auROC of 0.806 and 0.794 for the internal and external validation set, respectively. Age was the most important feature influencing the performance of the CatBoost model, followed by fasting blood glucose, red blood cells, haemoglobin, body mass index, and triglyceride-glucose.

Conclusion

A well-performed ML model to identify AGM in the normoglycemia population was built, offering significant potential for early intervention and treatment of AGM that would otherwise have been missed.

Abstract Image

查看原文本刊更多论文

开发和验证用于在血糖正常情况下识别糖尿病前期和糖尿病的机器学习模型。

背景：糖尿病前期和糖尿病都是可导致严重并发症的糖代谢异常状态（AGM）。早期发现 AGM 对及时干预和治疗至关重要。然而，空腹血糖（FBG）作为一种大规模人群筛查方法，可能无法识别出一些实际上属于糖代谢异常但血糖正常的个体。本研究旨在开发和验证机器学习（ML）模型，利用常规健康检查指标在血糖正常者中识别 AGM：根据美国糖尿病协会（ADA）的标准，收集了2019年至2023年血糖正常（FBG ≤ 5.6 mmol/L）的参与者，然后以糖化血红蛋白（HbA1c）5.7%为阈值将其分为AGM组和正常组。2019 年至 2022 年的数据按 7:3 的比例分为训练集和内部验证集，而 2023 年的数据则作为外部验证集。七种 ML 算法（包括逻辑回归 (LR)、随机森林 (RF)、支持向量机 (SVM)、极梯度提升机、多层感知器 (MLP)、轻梯度提升机 (LightGBM) 和分类提升 (CatBoost)）被用于建立识别正常血糖人群中 AGM 的模型。使用接收者操作特征曲线下面积（auROC）和精确度-召回曲线（auPR）对模型性能进行评估。使用 SHapley Additive exPlanations（SHAP）对最佳模型的特征贡献进行了可视化。最后，还开发了一个直观、用户友好的交互界面：最终共有 59,259 人参与了这项研究，并被分为 32,810 人的训练集、14,060 人的内部验证集和 12,389 人的外部验证集。Catboost 模型在内部和外部验证集上的 auROC 分别为 0.806 和 0.794，优于其他模型。年龄是影响 CatBoost 模型性能的最重要特征，其次是空腹血糖、红细胞、血红蛋白、体重指数和甘油三酯-葡萄糖：建立了一个性能良好的 ML 模型来识别正常血糖人群中的 AGM，为早期干预和治疗 AGM 提供了巨大的潜力，否则这些患者可能会被漏诊。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Diabetes/Metabolism Research and Reviews 医学-内分泌学与代谢

CiteScore

17.20

自引率

2.50%

发文量

审稿时长

4-8 weeks

期刊介绍： Diabetes/Metabolism Research and Reviews is a premier endocrinology and metabolism journal esteemed by clinicians and researchers alike. Encompassing a wide spectrum of topics including diabetes, endocrinology, metabolism, and obesity, the journal eagerly accepts submissions ranging from clinical studies to basic and translational research, as well as reviews exploring historical progress, controversial issues, and prominent opinions in the field. Join us in advancing knowledge and understanding in the realm of diabetes and metabolism.