Machine learning-based reproducible prediction of type 2 diabetes subtypes.

IF 8.4 1区医学 Q1 ENDOCRINOLOGY & METABOLISM

Diabetologia Pub Date : 2024-11-01 Epub Date: 2024-08-21 DOI:10.1007/s00125-024-06248-8

Hayato Tanabe, Masahiro Sato, Akimitsu Miyake, Yoshinori Shimajiri, Takafumi Ojima, Akira Narita, Haruka Saito, Kenichi Tanaka, Hiroaki Masuzaki, Junichiro J Kazama, Hideki Katagiri, Gen Tamiya, Eiryo Kawakami, Michio Shimabukuro

{"title":"Machine learning-based reproducible prediction of type 2 diabetes subtypes.","authors":"Hayato Tanabe, Masahiro Sato, Akimitsu Miyake, Yoshinori Shimajiri, Takafumi Ojima, Akira Narita, Haruka Saito, Kenichi Tanaka, Hiroaki Masuzaki, Junichiro J Kazama, Hideki Katagiri, Gen Tamiya, Eiryo Kawakami, Michio Shimabukuro","doi":"10.1007/s00125-024-06248-8","DOIUrl":null,"url":null,"abstract":"Aims/hypothesis: Clustering-based subclassification of type 2 diabetes, which reflects pathophysiology and genetic predisposition, is a promising approach for providing personalised and effective therapeutic strategies. Ahlqvist's classification is currently the most vigorously validated method because of its superior ability to predict diabetes complications but it does not have strong consistency over time and requires HOMA2 indices, which are not routinely available in clinical practice and standard cohort studies. We developed a machine learning (ML) model to classify individuals with type 2 diabetes into Ahlqvist's subtypes consistently over time.Methods: Cohort 1 dataset comprised 619 Japanese individuals with type 2 diabetes who were divided into training and test sets for ML models in a 7:3 ratio. Cohort 2 dataset, comprising 597 individuals with type 2 diabetes, was used for external validation. Participants were pre-labelled (T2Dkmeans) by unsupervised k-means clustering based on Ahlqvist's variables (age at diagnosis, BMI, HbA1c, HOMA2-B and HOMA2-IR) to four subtypes: severe insulin-deficient diabetes (SIDD), severe insulin-resistant diabetes (SIRD), mild obesity-related diabetes (MOD) and mild age-related diabetes (MARD). We adopted 15 variables for a multiclass classification random forest (RF) algorithm to predict type 2 diabetes subtypes (T2DRF15). The proximity matrix computed by RF was visualised using a uniform manifold approximation and projection. Finally, we used a putative subset with missing insulin-related variables to test the predictive performance of the validation cohort, consistency of subtypes over time and prediction ability of diabetes complications.Results: T2DRF15 demonstrated a 94% accuracy for predicting T2Dkmeans type 2 diabetes subtypes (AUCs ≥0.99 and F1 score [an indicator calculated by harmonic mean from precision and recall] ≥0.9) and retained the predictive performance in the external validation cohort (86.3%). T2DRF15 showed an accuracy of 82.9% for detecting T2Dkmeans, also in a putative subset with missing insulin-related variables, when used with an imputation algorithm. In Kaplan-Meier analysis, the diabetes clusters of T2DRF15 demonstrated distinct accumulation risks of diabetic retinopathy in SIDD and that of chronic kidney disease in SIRD during a median observation period of 11.6 (4.5-18.3) years, similarly to the subtypes using T2Dkmeans. The predictive accuracy was improved after excluding individuals with low predictive probability, who were categorised as an 'undecidable' cluster. T2DRF15, after excluding undecidable individuals, showed higher consistency (100% for SIDD, 68.6% for SIRD, 94.4% for MOD and 97.9% for MARD) than T2Dkmeans.Conclusions/interpretation: The new ML model for predicting Ahlqvist's subtypes of type 2 diabetes has great potential for application in clinical practice and cohort studies because it can classify individuals with missing HOMA2 indices and predict glycaemic control, diabetic complications and treatment outcomes with long-term consistency by using readily available variables. Future studies are needed to assess whether our approach is applicable to research and/or clinical practice in multiethnic populations.","PeriodicalId":11164,"journal":{"name":"Diabetologia","volume":" ","pages":"2446-2458"},"PeriodicalIF":8.4000,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11519166/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Diabetologia","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s00125-024-06248-8","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/8/21 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"ENDOCRINOLOGY & METABOLISM","Score":null,"Total":0}

引用次数: 0

Abstract

Aims/hypothesis: Clustering-based subclassification of type 2 diabetes, which reflects pathophysiology and genetic predisposition, is a promising approach for providing personalised and effective therapeutic strategies. Ahlqvist's classification is currently the most vigorously validated method because of its superior ability to predict diabetes complications but it does not have strong consistency over time and requires HOMA2 indices, which are not routinely available in clinical practice and standard cohort studies. We developed a machine learning (ML) model to classify individuals with type 2 diabetes into Ahlqvist's subtypes consistently over time.

Methods: Cohort 1 dataset comprised 619 Japanese individuals with type 2 diabetes who were divided into training and test sets for ML models in a 7:3 ratio. Cohort 2 dataset, comprising 597 individuals with type 2 diabetes, was used for external validation. Participants were pre-labelled (T2D_kmeans) by unsupervised k-means clustering based on Ahlqvist's variables (age at diagnosis, BMI, HbA_1c, HOMA2-B and HOMA2-IR) to four subtypes: severe insulin-deficient diabetes (SIDD), severe insulin-resistant diabetes (SIRD), mild obesity-related diabetes (MOD) and mild age-related diabetes (MARD). We adopted 15 variables for a multiclass classification random forest (RF) algorithm to predict type 2 diabetes subtypes (T2D_RF15). The proximity matrix computed by RF was visualised using a uniform manifold approximation and projection. Finally, we used a putative subset with missing insulin-related variables to test the predictive performance of the validation cohort, consistency of subtypes over time and prediction ability of diabetes complications.

Results: T2D_RF15 demonstrated a 94% accuracy for predicting T2D_kmeans type 2 diabetes subtypes (AUCs ≥0.99 and F1 score [an indicator calculated by harmonic mean from precision and recall] ≥0.9) and retained the predictive performance in the external validation cohort (86.3%). T2D_RF15 showed an accuracy of 82.9% for detecting T2D_kmeans, also in a putative subset with missing insulin-related variables, when used with an imputation algorithm. In Kaplan-Meier analysis, the diabetes clusters of T2D_RF15 demonstrated distinct accumulation risks of diabetic retinopathy in SIDD and that of chronic kidney disease in SIRD during a median observation period of 11.6 (4.5-18.3) years, similarly to the subtypes using T2D_kmeans. The predictive accuracy was improved after excluding individuals with low predictive probability, who were categorised as an 'undecidable' cluster. T2D_RF15, after excluding undecidable individuals, showed higher consistency (100% for SIDD, 68.6% for SIRD, 94.4% for MOD and 97.9% for MARD) than T2D_kmeans.

Conclusions/interpretation: The new ML model for predicting Ahlqvist's subtypes of type 2 diabetes has great potential for application in clinical practice and cohort studies because it can classify individuals with missing HOMA2 indices and predict glycaemic control, diabetic complications and treatment outcomes with long-term consistency by using readily available variables. Future studies are needed to assess whether our approach is applicable to research and/or clinical practice in multiethnic populations.

Abstract Image

查看原文本刊更多论文

基于机器学习的 2 型糖尿病亚型重复预测。

目的/假设：基于聚类的 2 型糖尿病亚分类反映了病理生理学和遗传易感性，是提供个性化有效治疗策略的一种有前途的方法。Ahlqvist分类法因其预测糖尿病并发症的卓越能力而成为目前最有效的方法，但它并不具有很强的长期一致性，而且需要HOMA2指数，而临床实践和标准队列研究中并没有常规的HOMA2指数。我们开发了一种机器学习（ML）模型，用于将 2 型糖尿病患者长期一致地分为 Ahlqvist 亚型：队列 1 数据集由 619 名日本 2 型糖尿病患者组成，他们按 7:3 的比例被分为机器学习模型的训练集和测试集。队列 2 数据集由 597 名 2 型糖尿病患者组成，用于外部验证。通过基于 Ahlqvist 变量（诊断年龄、体重指数、血红蛋白 A1c、HOMA2-B 和 HOMA2-IR）的无监督 k-means 聚类对参与者进行预标签（T2Dkmeans），将其分为四个亚型：严重胰岛素缺乏糖尿病（SIDD）、严重胰岛素抵抗糖尿病（SIRD）、轻度肥胖相关糖尿病（MOD）和轻度年龄相关糖尿病（MARD）。我们采用 15 个变量的多类分类随机森林（RF）算法来预测 2 型糖尿病亚型（T2DRF15）。利用均匀流形近似法和投影法将 RF 计算出的邻近矩阵可视化。最后，我们使用一个缺失胰岛素相关变量的假定子集来测试验证队列的预测性能、亚型随时间变化的一致性以及糖尿病并发症的预测能力：T2DRF15预测T2Dkmeans 2型糖尿病亚型的准确率为94%（AUC≥0.99，F1得分[由精确度和召回率的调和平均值计算得出的指标]≥0.9），并在外部验证队列中保持了预测性能（86.3%）。T2DRF15 在检测 T2Dkmeans 方面的准确率为 82.9%，在使用估算算法时，也适用于胰岛素相关变量缺失的假定子集。在 Kaplan-Meier 分析中，T2DRF15 的糖尿病群组在中位 11.6（4.5-18.3）年的观察期内显示了 SIDD 中糖尿病视网膜病变和 SIRD 中慢性肾病的不同累积风险，这与使用 T2Dkmeans 的亚型类似。在排除了预测概率较低的个体后，预测准确性有所提高，这些个体被归类为 "无法确定 "的群组。与 T2Dkmeans 相比，T2DRF15 在排除 "无法确定 "的个体后显示出更高的一致性（SIDD 为 100%，SIRD 为 68.6%，MOD 为 94.4%，MARD 为 97.9%）：预测 Ahlqvist 2 型糖尿病亚型的新 ML 模型具有在临床实践和队列研究中应用的巨大潜力，因为它可以对 HOMA2 指数缺失的个体进行分类，并通过使用现成的变量预测血糖控制、糖尿病并发症和治疗效果，而且具有长期一致性。今后还需要进行研究，以评估我们的方法是否适用于多种族人群的研究和/或临床实践。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Diabetologia 医学-内分泌学与代谢

CiteScore

18.10

自引率

2.40%

发文量

193

审稿时长

1 months

期刊介绍： Diabetologia, the authoritative journal dedicated to diabetes research, holds high visibility through society membership, libraries, and social media. As the official journal of the European Association for the Study of Diabetes, it is ranked in the top quartile of the 2019 JCR Impact Factors in the Endocrinology & Metabolism category. The journal boasts dedicated and expert editorial teams committed to supporting authors throughout the peer review process.