An explainable machine learning model for predicting chronic coronary disease and identifying valuable text features.

IF 2.8 3区 医学 Q2 CARDIAC & CARDIOVASCULAR SYSTEMS
Frontiers in Cardiovascular Medicine Pub Date : 2025-09-22 eCollection Date: 2025-01-01 DOI:10.3389/fcvm.2025.1559831
Weipeng Gan, Peipei Wang, Xiangrong Xie, Lingfei Yang, Dasheng Lu, Sheng Ye, Mingquan Ye
{"title":"An explainable machine learning model for predicting chronic coronary disease and identifying valuable text features.","authors":"Weipeng Gan, Peipei Wang, Xiangrong Xie, Lingfei Yang, Dasheng Lu, Sheng Ye, Mingquan Ye","doi":"10.3389/fcvm.2025.1559831","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Chronic Coronary Disease (CCD) is a leading global cause of morbidity and mortality. Existing Pre-test Probability (PTP) models mainly rely on in-hospital data and clinician judgment. This study aims to construct machine learning (ML) models for predicting CCD by using easily accessible text data and baseline characteristics, and to evaluate the contribution of text data to the diagnostic model.</p><p><strong>Methods: </strong>The chief complaints, present illness, past medical history and vital signs of the patients from the internal medicine departments of the First Affiliated Hospital and the Second Affiliated Hospital of Wannan Medical College were gathered. The text data of the research subjects were structured by using text mining technology. A customized \"stop words\" list and \"custom dictionary\" for cardiovascular medicine were created to optimize the processing of text data. Then, ML algorithms were employed to establish CCD prediction models. Finally, the Shapley additive explanation (SHAP) algorithm was used to interpret the models.</p><p><strong>Results: </strong>We enrolled a total of 21,855 patients in this study, with 7,449 in the CCD group and 14,406 in the non-CCD group. Patients in the CCD group were generally older and had a higher male proportion. After conducting feature engineering, we successfully constructed a Random Forest model. The model achieved an area under the ROC curve (AUC) of 0.93 (95% CI, 0.93-0.94), demonstrating excellent performance in horizontal comparisons. Using the SHAP algorithm, valuable text features like \"chest pain\", \"chest tightness\" and structured features such as age, which are crucial for CCD judgment, were identified. Additionally, an illustration of how these features influenced the model's decision-making process was provided.</p><p><strong>Conclusion: </strong>Clinicians can leverage text data to construct a prediction model for CCD and apply the SHAP approach to pinpoint valuable text features and elucidate the model's decision-making mechanism.</p>","PeriodicalId":12414,"journal":{"name":"Frontiers in Cardiovascular Medicine","volume":"12 ","pages":"1559831"},"PeriodicalIF":2.8000,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12497772/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in Cardiovascular Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.3389/fcvm.2025.1559831","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"CARDIAC & CARDIOVASCULAR SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Chronic Coronary Disease (CCD) is a leading global cause of morbidity and mortality. Existing Pre-test Probability (PTP) models mainly rely on in-hospital data and clinician judgment. This study aims to construct machine learning (ML) models for predicting CCD by using easily accessible text data and baseline characteristics, and to evaluate the contribution of text data to the diagnostic model.

Methods: The chief complaints, present illness, past medical history and vital signs of the patients from the internal medicine departments of the First Affiliated Hospital and the Second Affiliated Hospital of Wannan Medical College were gathered. The text data of the research subjects were structured by using text mining technology. A customized "stop words" list and "custom dictionary" for cardiovascular medicine were created to optimize the processing of text data. Then, ML algorithms were employed to establish CCD prediction models. Finally, the Shapley additive explanation (SHAP) algorithm was used to interpret the models.

Results: We enrolled a total of 21,855 patients in this study, with 7,449 in the CCD group and 14,406 in the non-CCD group. Patients in the CCD group were generally older and had a higher male proportion. After conducting feature engineering, we successfully constructed a Random Forest model. The model achieved an area under the ROC curve (AUC) of 0.93 (95% CI, 0.93-0.94), demonstrating excellent performance in horizontal comparisons. Using the SHAP algorithm, valuable text features like "chest pain", "chest tightness" and structured features such as age, which are crucial for CCD judgment, were identified. Additionally, an illustration of how these features influenced the model's decision-making process was provided.

Conclusion: Clinicians can leverage text data to construct a prediction model for CCD and apply the SHAP approach to pinpoint valuable text features and elucidate the model's decision-making mechanism.

一个可解释的机器学习模型,用于预测慢性冠状动脉疾病和识别有价值的文本特征。
背景:慢性冠状动脉疾病(CCD)是全球发病率和死亡率的主要原因。现有的预测概率(PTP)模型主要依赖于医院数据和临床医生的判断。本研究旨在利用易于获取的文本数据和基线特征构建用于预测CCD的机器学习(ML)模型,并评估文本数据对诊断模型的贡献。方法:收集皖南医学院第一附属医院和第二附属医院内科患者的主诉、现病、既往病史和生命体征。采用文本挖掘技术对研究对象的文本数据进行结构化处理。创建心血管医学自定义“停词”列表和自定义词典,优化文本数据处理。然后,利用ML算法建立CCD预测模型。最后,采用Shapley加性解释(SHAP)算法对模型进行解释。结果:本研究共纳入21855例患者,其中CCD组7449例,非CCD组14406例。CCD组患者普遍年龄较大,男性比例较高。在进行特征工程之后,我们成功地构建了一个随机森林模型。该模型的ROC曲线下面积(AUC)为0.93 (95% CI, 0.93-0.94),在水平比较中表现出优异的性能。利用SHAP算法,识别出对CCD判断至关重要的“胸痛”、“胸闷”等有价值的文本特征和年龄等结构化特征。此外,还提供了这些特征如何影响模型决策过程的说明。结论:临床医生可以利用文本数据构建CCD预测模型,并应用SHAP方法找出有价值的文本特征,阐明模型的决策机制。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Frontiers in Cardiovascular Medicine
Frontiers in Cardiovascular Medicine Medicine-Cardiology and Cardiovascular Medicine
CiteScore
3.80
自引率
11.10%
发文量
3529
审稿时长
14 weeks
期刊介绍: Frontiers? Which frontiers? Where exactly are the frontiers of cardiovascular medicine? And who should be defining these frontiers? At Frontiers in Cardiovascular Medicine we believe it is worth being curious to foresee and explore beyond the current frontiers. In other words, we would like, through the articles published by our community journal Frontiers in Cardiovascular Medicine, to anticipate the future of cardiovascular medicine, and thus better prevent cardiovascular disorders and improve therapeutic options and outcomes of our patients.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信