Prediction of the risk of developing end-stage renal diseases in newly diagnosed type 2 diabetes mellitus using artificial intelligence algorithms.

IF 4 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining Pub Date : 2023-03-10 DOI:10.1186/s13040-023-00324-2

Shuo-Ming Ou, Ming-Tsun Tsai, Kuo-Hua Lee, Wei-Cheng Tseng, Chih-Yu Yang, Tz-Heng Chen, Pin-Jie Bin, Tzeng-Ji Chen, Yao-Ping Lin, Wayne Huey-Herng Sheu, Yuan-Chia Chu, Der-Cherng Tarng

{"title":"Prediction of the risk of developing end-stage renal diseases in newly diagnosed type 2 diabetes mellitus using artificial intelligence algorithms.","authors":"Shuo-Ming Ou, Ming-Tsun Tsai, Kuo-Hua Lee, Wei-Cheng Tseng, Chih-Yu Yang, Tz-Heng Chen, Pin-Jie Bin, Tzeng-Ji Chen, Yao-Ping Lin, Wayne Huey-Herng Sheu, Yuan-Chia Chu, Der-Cherng Tarng","doi":"10.1186/s13040-023-00324-2","DOIUrl":null,"url":null,"abstract":"Objectives: Type 2 diabetes mellitus (T2DM) imposes a great burden on healthcare systems, and these patients experience higher long-term risks for developing end-stage renal disease (ESRD). Managing diabetic nephropathy becomes more challenging when kidney function starts declining. Therefore, developing predictive models for the risk of developing ESRD in newly diagnosed T2DM patients may be helpful in clinical settings.Methods: We established machine learning models constructed from a subset of clinical features collected from 53,477 newly diagnosed T2DM patients from January 2008 to December 2018 and then selected the best model. The cohort was divided, with 70% and 30% of patients randomly assigned to the training and testing sets, respectively.Results: The discriminative ability of our machine learning models, including logistic regression, extra tree classifier, random forest, gradient boosting decision tree (GBDT), extreme gradient boosting (XGBoost), and light gradient boosting machine were evaluated across the cohort. XGBoost yielded the highest area under the receiver operating characteristic curve (AUC) of 0.953, followed by extra tree and GBDT, with AUC values of 0.952 and 0.938 on the testing dataset. The SHapley Additive explanation summary plot in the XGBoost model illustrated that the top five important features included baseline serum creatinine, mean serum creatine within 1 year before the diagnosis of T2DM, high-sensitivity C-reactive protein, spot urine protein-to-creatinine ratio and female gender.Conclusions: Because our machine learning prediction models were based on routinely collected clinical features, they can be used as risk assessment tools for developing ESRD. By identifying high-risk patients, intervention strategies may be provided at an early stage.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"8"},"PeriodicalIF":4.0000,"publicationDate":"2023-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10007785/pdf/","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biodata Mining","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13040-023-00324-2","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 5

Abstract

Objectives: Type 2 diabetes mellitus (T2DM) imposes a great burden on healthcare systems, and these patients experience higher long-term risks for developing end-stage renal disease (ESRD). Managing diabetic nephropathy becomes more challenging when kidney function starts declining. Therefore, developing predictive models for the risk of developing ESRD in newly diagnosed T2DM patients may be helpful in clinical settings.

Methods: We established machine learning models constructed from a subset of clinical features collected from 53,477 newly diagnosed T2DM patients from January 2008 to December 2018 and then selected the best model. The cohort was divided, with 70% and 30% of patients randomly assigned to the training and testing sets, respectively.

Results: The discriminative ability of our machine learning models, including logistic regression, extra tree classifier, random forest, gradient boosting decision tree (GBDT), extreme gradient boosting (XGBoost), and light gradient boosting machine were evaluated across the cohort. XGBoost yielded the highest area under the receiver operating characteristic curve (AUC) of 0.953, followed by extra tree and GBDT, with AUC values of 0.952 and 0.938 on the testing dataset. The SHapley Additive explanation summary plot in the XGBoost model illustrated that the top five important features included baseline serum creatinine, mean serum creatine within 1 year before the diagnosis of T2DM, high-sensitivity C-reactive protein, spot urine protein-to-creatinine ratio and female gender.

Conclusions: Because our machine learning prediction models were based on routinely collected clinical features, they can be used as risk assessment tools for developing ESRD. By identifying high-risk patients, intervention strategies may be provided at an early stage.

Abstract Image

查看原文本刊更多论文

用人工智能算法预测新诊断的2型糖尿病患者发生终末期肾脏疾病的风险

目的:2型糖尿病(T2DM)给医疗系统带来了巨大的负担，这些患者发展为终末期肾脏疾病(ESRD)的长期风险更高。当肾功能开始下降时，糖尿病肾病的治疗变得更具挑战性。因此，开发新诊断T2DM患者发生ESRD风险的预测模型可能有助于临床设置。方法:从2008年1月至2018年12月收集的53,477例新诊断T2DM患者的临床特征子集中建立机器学习模型，然后选择最佳模型。队列被划分，70%和30%的患者被随机分配到训练组和测试组。结果:我们的机器学习模型，包括逻辑回归、额外树分类器、随机森林、梯度增强决策树(GBDT)、极端梯度增强(XGBoost)和轻梯度增强机的判别能力在整个队列中进行了评估。XGBoost在接收者工作特征曲线(AUC)下的面积最高，为0.953，其次是extra tree和GBDT，在测试数据集上的AUC值分别为0.952和0.938。XGBoost模型中的SHapley Additive解释总结图显示，最重要的5个特征包括基线血清肌酐、诊断T2DM前1年内平均血清肌酐、高敏c反应蛋白、尿样蛋白/肌酐比和女性性别。结论:由于我们的机器学习预测模型是基于常规收集的临床特征，因此它们可以作为发生ESRD的风险评估工具。通过识别高危患者，可以在早期阶段提供干预策略。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Biodata Mining MATHEMATICAL & COMPUTATIONAL BIOLOGY-

CiteScore

7.90

自引率

0.00%

发文量

审稿时长

23 weeks

期刊介绍： BioData Mining is an open access, open peer-reviewed journal encompassing research on all aspects of data mining applied to high-dimensional biological and biomedical data, focusing on computational aspects of knowledge discovery from large-scale genetic, transcriptomic, genomic, proteomic, and metabolomic data. Topical areas include, but are not limited to: -Development, evaluation, and application of novel data mining and machine learning algorithms. -Adaptation, evaluation, and application of traditional data mining and machine learning algorithms. -Open-source software for the application of data mining and machine learning algorithms. -Design, development and integration of databases, software and web services for the storage, management, retrieval, and analysis of data from large scale studies. -Pre-processing, post-processing, modeling, and interpretation of data mining and machine learning results for biological interpretation and knowledge discovery.