Comparing Explainable Machine Learning Approaches With Traditional Statistical Methods for Evaluating Stroke Risk Models: Retrospective Cohort Study.

Q2 Medicine
JMIR Cardio Pub Date : 2023-07-26 DOI:10.2196/47736
Sermkiat Lolak, John Attia, Gareth J McKay, Ammarin Thakkinstian
{"title":"Comparing Explainable Machine Learning Approaches With Traditional Statistical Methods for Evaluating Stroke Risk Models: Retrospective Cohort Study.","authors":"Sermkiat Lolak,&nbsp;John Attia,&nbsp;Gareth J McKay,&nbsp;Ammarin Thakkinstian","doi":"10.2196/47736","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Stroke has multiple modifiable and nonmodifiable risk factors and represents a leading cause of death globally. Understanding the complex interplay of stroke risk factors is thus not only a scientific necessity but a critical step toward improving global health outcomes.</p><p><strong>Objective: </strong>We aim to assess the performance of explainable machine learning models in predicting stroke risk factors using real-world cohort data by comparing explainable machine learning models with conventional statistical methods.</p><p><strong>Methods: </strong>This retrospective cohort included high-risk patients from Ramathibodi Hospital in Thailand between January 2010 and December 2020. We compared the performance and explainability of logistic regression (LR), Cox proportional hazard, Bayesian network (BN), tree-augmented Naïve Bayes (TAN), extreme gradient boosting (XGBoost), and explainable boosting machine (EBM) models. We used multiple imputation by chained equations for missing data and discretized continuous variables as needed. Models were evaluated using C-statistics and F<sub>1</sub>-scores.</p><p><strong>Results: </strong>Out of 275,247 high-risk patients, 9659 (3.5%) experienced a stroke. XGBoost demonstrated the highest performance with a C-statistic of 0.89 and an F<sub>1</sub>-score of 0.80 followed by EBM and TAN with C-statistics of 0.87 and 0.83, respectively; LR and BN had similar C-statistics of 0.80. Significant factors associated with stroke included atrial fibrillation (AF), hypertension (HT), antiplatelets, HDL, and age. AF, HT, and antihypertensive medication were common significant factors across most models, with AF being the strongest factor in LR, XGBoost, BN, and TAN models.</p><p><strong>Conclusions: </strong>Our study developed stroke prediction models to identify crucial predictive factors such as AF, HT, or systolic blood pressure or antihypertensive medication, anticoagulant medication, HDL, age, and statin use in high-risk patients. The explainable XGBoost was the best model in predicting stroke risk, followed by EBM.</p>","PeriodicalId":14706,"journal":{"name":"JMIR Cardio","volume":"7 ","pages":"e47736"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10413234/pdf/","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Cardio","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/47736","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Medicine","Score":null,"Total":0}
引用次数: 1

Abstract

Background: Stroke has multiple modifiable and nonmodifiable risk factors and represents a leading cause of death globally. Understanding the complex interplay of stroke risk factors is thus not only a scientific necessity but a critical step toward improving global health outcomes.

Objective: We aim to assess the performance of explainable machine learning models in predicting stroke risk factors using real-world cohort data by comparing explainable machine learning models with conventional statistical methods.

Methods: This retrospective cohort included high-risk patients from Ramathibodi Hospital in Thailand between January 2010 and December 2020. We compared the performance and explainability of logistic regression (LR), Cox proportional hazard, Bayesian network (BN), tree-augmented Naïve Bayes (TAN), extreme gradient boosting (XGBoost), and explainable boosting machine (EBM) models. We used multiple imputation by chained equations for missing data and discretized continuous variables as needed. Models were evaluated using C-statistics and F1-scores.

Results: Out of 275,247 high-risk patients, 9659 (3.5%) experienced a stroke. XGBoost demonstrated the highest performance with a C-statistic of 0.89 and an F1-score of 0.80 followed by EBM and TAN with C-statistics of 0.87 and 0.83, respectively; LR and BN had similar C-statistics of 0.80. Significant factors associated with stroke included atrial fibrillation (AF), hypertension (HT), antiplatelets, HDL, and age. AF, HT, and antihypertensive medication were common significant factors across most models, with AF being the strongest factor in LR, XGBoost, BN, and TAN models.

Conclusions: Our study developed stroke prediction models to identify crucial predictive factors such as AF, HT, or systolic blood pressure or antihypertensive medication, anticoagulant medication, HDL, age, and statin use in high-risk patients. The explainable XGBoost was the best model in predicting stroke risk, followed by EBM.

比较可解释机器学习方法与传统统计方法评估中风风险模型:回顾性队列研究。
背景:卒中具有多种可改变和不可改变的危险因素,是全球死亡的主要原因。因此,了解中风危险因素之间复杂的相互作用不仅是科学上的必要,而且是改善全球健康状况的关键一步。目的:我们旨在通过比较可解释机器学习模型与传统统计方法,评估可解释机器学习模型在预测中风危险因素方面的性能。方法:该回顾性队列包括2010年1月至2020年12月期间泰国Ramathibodi医院的高危患者。我们比较了逻辑回归(LR)、Cox比例风险、贝叶斯网络(BN)、树增强Naïve贝叶斯(TAN)、极端梯度增强(XGBoost)和可解释增强机(EBM)模型的性能和可解释性。我们使用链式方程对缺失数据和需要的离散连续变量进行多次插值。采用c统计和f1评分对模型进行评价。结果:275247例高危患者中,9659例(3.5%)发生脑卒中。XGBoost的c统计量最高,为0.89,f1得分为0.80,其次是EBM和TAN, c统计量分别为0.87和0.83;LR和BN的c统计量相似,均为0.80。与卒中相关的重要因素包括房颤(AF)、高血压(HT)、抗血小板、高密度脂蛋白(HDL)和年龄。AF、HT和抗高血压药物是大多数模型中常见的显著因素,其中AF是LR、XGBoost、BN和TAN模型中最强的因素。结论:我们的研究建立了卒中预测模型,以确定高危患者的关键预测因素,如房颤、HT、收缩压或抗高血压药物、抗凝药物、HDL、年龄和他汀类药物的使用。可解释的XGBoost是预测中风风险的最佳模型,其次是EBM。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
JMIR Cardio
JMIR Cardio Computer Science-Computer Science Applications
CiteScore
3.50
自引率
0.00%
发文量
25
审稿时长
12 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信