Weipeng Gan, Peipei Wang, Xiangrong Xie, Lingfei Yang, Dasheng Lu, Sheng Ye, Mingquan Ye
{"title":"An explainable machine learning model for predicting chronic coronary disease and identifying valuable text features.","authors":"Weipeng Gan, Peipei Wang, Xiangrong Xie, Lingfei Yang, Dasheng Lu, Sheng Ye, Mingquan Ye","doi":"10.3389/fcvm.2025.1559831","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Chronic Coronary Disease (CCD) is a leading global cause of morbidity and mortality. Existing Pre-test Probability (PTP) models mainly rely on in-hospital data and clinician judgment. This study aims to construct machine learning (ML) models for predicting CCD by using easily accessible text data and baseline characteristics, and to evaluate the contribution of text data to the diagnostic model.</p><p><strong>Methods: </strong>The chief complaints, present illness, past medical history and vital signs of the patients from the internal medicine departments of the First Affiliated Hospital and the Second Affiliated Hospital of Wannan Medical College were gathered. The text data of the research subjects were structured by using text mining technology. A customized \"stop words\" list and \"custom dictionary\" for cardiovascular medicine were created to optimize the processing of text data. Then, ML algorithms were employed to establish CCD prediction models. Finally, the Shapley additive explanation (SHAP) algorithm was used to interpret the models.</p><p><strong>Results: </strong>We enrolled a total of 21,855 patients in this study, with 7,449 in the CCD group and 14,406 in the non-CCD group. Patients in the CCD group were generally older and had a higher male proportion. After conducting feature engineering, we successfully constructed a Random Forest model. The model achieved an area under the ROC curve (AUC) of 0.93 (95% CI, 0.93-0.94), demonstrating excellent performance in horizontal comparisons. Using the SHAP algorithm, valuable text features like \"chest pain\", \"chest tightness\" and structured features such as age, which are crucial for CCD judgment, were identified. Additionally, an illustration of how these features influenced the model's decision-making process was provided.</p><p><strong>Conclusion: </strong>Clinicians can leverage text data to construct a prediction model for CCD and apply the SHAP approach to pinpoint valuable text features and elucidate the model's decision-making mechanism.</p>","PeriodicalId":12414,"journal":{"name":"Frontiers in Cardiovascular Medicine","volume":"12 ","pages":"1559831"},"PeriodicalIF":2.8000,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12497772/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in Cardiovascular Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.3389/fcvm.2025.1559831","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"CARDIAC & CARDIOVASCULAR SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Chronic Coronary Disease (CCD) is a leading global cause of morbidity and mortality. Existing Pre-test Probability (PTP) models mainly rely on in-hospital data and clinician judgment. This study aims to construct machine learning (ML) models for predicting CCD by using easily accessible text data and baseline characteristics, and to evaluate the contribution of text data to the diagnostic model.
Methods: The chief complaints, present illness, past medical history and vital signs of the patients from the internal medicine departments of the First Affiliated Hospital and the Second Affiliated Hospital of Wannan Medical College were gathered. The text data of the research subjects were structured by using text mining technology. A customized "stop words" list and "custom dictionary" for cardiovascular medicine were created to optimize the processing of text data. Then, ML algorithms were employed to establish CCD prediction models. Finally, the Shapley additive explanation (SHAP) algorithm was used to interpret the models.
Results: We enrolled a total of 21,855 patients in this study, with 7,449 in the CCD group and 14,406 in the non-CCD group. Patients in the CCD group were generally older and had a higher male proportion. After conducting feature engineering, we successfully constructed a Random Forest model. The model achieved an area under the ROC curve (AUC) of 0.93 (95% CI, 0.93-0.94), demonstrating excellent performance in horizontal comparisons. Using the SHAP algorithm, valuable text features like "chest pain", "chest tightness" and structured features such as age, which are crucial for CCD judgment, were identified. Additionally, an illustration of how these features influenced the model's decision-making process was provided.
Conclusion: Clinicians can leverage text data to construct a prediction model for CCD and apply the SHAP approach to pinpoint valuable text features and elucidate the model's decision-making mechanism.
期刊介绍:
Frontiers? Which frontiers? Where exactly are the frontiers of cardiovascular medicine? And who should be defining these frontiers?
At Frontiers in Cardiovascular Medicine we believe it is worth being curious to foresee and explore beyond the current frontiers. In other words, we would like, through the articles published by our community journal Frontiers in Cardiovascular Medicine, to anticipate the future of cardiovascular medicine, and thus better prevent cardiovascular disorders and improve therapeutic options and outcomes of our patients.