Thien Vu, Yoshihiro Kokubo, Mai Inoue, Masaki Yamamoto, Attayeb Mohsen, Agustin Martin-Morales, Research Dawadi, Takao Inoue, Jie Ting Tay, Mari Yoshizaki, Naoki Watanabe, Yuki Kuriya, Chisa Matsumoto, Ahmed Arafa, Yoko M Nakao, Yuka Kato, Masayuki Teramoto, Michihiro Araki
{"title":"预测冠心病风险的机器学习模型:基于日本人群研究的发展和验证","authors":"Thien Vu, Yoshihiro Kokubo, Mai Inoue, Masaki Yamamoto, Attayeb Mohsen, Agustin Martin-Morales, Research Dawadi, Takao Inoue, Jie Ting Tay, Mari Yoshizaki, Naoki Watanabe, Yuki Kuriya, Chisa Matsumoto, Ahmed Arafa, Yoko M Nakao, Yuka Kato, Masayuki Teramoto, Michihiro Araki","doi":"10.2196/68066","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Coronary heart disease (CHD) is a major cause of morbidity and mortality worldwide. Identifying key risk factors is essential for effective risk assessment and prevention. A data-driven approach using machine learning (ML) offers advanced techniques to analyze complex, nonlinear, and high-dimensional datasets, uncovering novel predictors of CHD that go beyond the limitations of traditional models, which rely on predefined variables.</p><p><strong>Objective: </strong>This study aims to evaluate the contribution of various risk factors to CHD, focusing on both established and novel markers using ML techniques.</p><p><strong>Methods: </strong>The study recruited 7672 participants aged 30-84 years from Suita City, Japan, between 1989 and 1999. Over an average of 15 years, participants were monitored for cardiovascular events. A total of 7260 participants and 28 variables were included in the analysis after excluding individuals with missing outcome data and eliminating unnecessary variables. Five ML models-logistic regression, random forest (RF), support vector machine, Extreme Gradient Boosting, and Light Gradient-Boosting Machine-were applied for predicting CHD incidence. Model performance was evaluated using accuracy, sensitivity, specificity, precision, area under the curve, F1-score, calibration curves, observed-to-expected ratios, and decision curve analysis. Additionally, Shapley Additive Explanations (SHAPs) were used to interpret the prediction models and understand the contribution of various risk factors to CHD.</p><p><strong>Results: </strong>Among 7260 participants, 305 (4.2%) were diagnosed with CHD. The RF model demonstrated the highest performance, with an accuracy of 0.73 (95% CI 0.64-0.80), sensitivity of 0.74 (95% CI 0.62-0.84), specificity of 0.72 (95% CI 0.61-0.83), and an area under the curve of 0.73 (95% CI 0.65-0.80). RF also showed excellent calibration, with predicted probabilities closely aligning with observed outcomes, and provided substantial net benefit across a range of risk thresholds, as demonstrated by decision curve analysis. SHAP analysis elucidated key predictors of CHD, including the intima-media thickness (IMT_cMax) of the common carotid artery, blood pressure, lipid profiles (non-high-density lipoprotein cholesterol, high-density lipoprotein cholesterol, and triglycerides), and estimated glomerular filtration rate. Novel risk factors identified as significant contributors to CHD risk included lower calcium levels, elevated white blood cell counts, and body fat percentage. Furthermore, a protective effect was observed in women, suggesting the potential necessity for gender-specific risk assessment strategies in future cardiovascular health evaluations.</p><p><strong>Conclusions: </strong>We developed a model to predict CHD using ML and applied SHAP methods for interpretation. This approach highlights the multifactor nature of CHD risk evaluation, aiming to support health care professionals in identifying risk factors and formulating effective prevention strategies.</p>","PeriodicalId":14706,"journal":{"name":"JMIR Cardio","volume":"9 ","pages":"e68066"},"PeriodicalIF":0.0000,"publicationDate":"2025-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Machine Learning Model for Predicting Coronary Heart Disease Risk: Development and Validation Using Insights From a Japanese Population-Based Study.\",\"authors\":\"Thien Vu, Yoshihiro Kokubo, Mai Inoue, Masaki Yamamoto, Attayeb Mohsen, Agustin Martin-Morales, Research Dawadi, Takao Inoue, Jie Ting Tay, Mari Yoshizaki, Naoki Watanabe, Yuki Kuriya, Chisa Matsumoto, Ahmed Arafa, Yoko M Nakao, Yuka Kato, Masayuki Teramoto, Michihiro Araki\",\"doi\":\"10.2196/68066\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Coronary heart disease (CHD) is a major cause of morbidity and mortality worldwide. Identifying key risk factors is essential for effective risk assessment and prevention. A data-driven approach using machine learning (ML) offers advanced techniques to analyze complex, nonlinear, and high-dimensional datasets, uncovering novel predictors of CHD that go beyond the limitations of traditional models, which rely on predefined variables.</p><p><strong>Objective: </strong>This study aims to evaluate the contribution of various risk factors to CHD, focusing on both established and novel markers using ML techniques.</p><p><strong>Methods: </strong>The study recruited 7672 participants aged 30-84 years from Suita City, Japan, between 1989 and 1999. Over an average of 15 years, participants were monitored for cardiovascular events. A total of 7260 participants and 28 variables were included in the analysis after excluding individuals with missing outcome data and eliminating unnecessary variables. Five ML models-logistic regression, random forest (RF), support vector machine, Extreme Gradient Boosting, and Light Gradient-Boosting Machine-were applied for predicting CHD incidence. Model performance was evaluated using accuracy, sensitivity, specificity, precision, area under the curve, F1-score, calibration curves, observed-to-expected ratios, and decision curve analysis. Additionally, Shapley Additive Explanations (SHAPs) were used to interpret the prediction models and understand the contribution of various risk factors to CHD.</p><p><strong>Results: </strong>Among 7260 participants, 305 (4.2%) were diagnosed with CHD. The RF model demonstrated the highest performance, with an accuracy of 0.73 (95% CI 0.64-0.80), sensitivity of 0.74 (95% CI 0.62-0.84), specificity of 0.72 (95% CI 0.61-0.83), and an area under the curve of 0.73 (95% CI 0.65-0.80). RF also showed excellent calibration, with predicted probabilities closely aligning with observed outcomes, and provided substantial net benefit across a range of risk thresholds, as demonstrated by decision curve analysis. SHAP analysis elucidated key predictors of CHD, including the intima-media thickness (IMT_cMax) of the common carotid artery, blood pressure, lipid profiles (non-high-density lipoprotein cholesterol, high-density lipoprotein cholesterol, and triglycerides), and estimated glomerular filtration rate. Novel risk factors identified as significant contributors to CHD risk included lower calcium levels, elevated white blood cell counts, and body fat percentage. Furthermore, a protective effect was observed in women, suggesting the potential necessity for gender-specific risk assessment strategies in future cardiovascular health evaluations.</p><p><strong>Conclusions: </strong>We developed a model to predict CHD using ML and applied SHAP methods for interpretation. This approach highlights the multifactor nature of CHD risk evaluation, aiming to support health care professionals in identifying risk factors and formulating effective prevention strategies.</p>\",\"PeriodicalId\":14706,\"journal\":{\"name\":\"JMIR Cardio\",\"volume\":\"9 \",\"pages\":\"e68066\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-05-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"JMIR Cardio\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2196/68066\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"Medicine\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Cardio","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/68066","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Medicine","Score":null,"Total":0}
引用次数: 0
摘要
背景:冠心病(CHD)是全世界发病率和死亡率的主要原因。识别关键风险因素对于有效的风险评估和预防至关重要。使用机器学习(ML)的数据驱动方法提供了分析复杂、非线性和高维数据集的先进技术,发现了新的冠心病预测因子,超越了依赖预定义变量的传统模型的局限性。目的:本研究旨在评估各种危险因素对冠心病的影响,重点关注利用ML技术建立的和新的标志物。方法:本研究于1989年至1999年间从日本水田市招募了7672名年龄在30-84岁之间的参与者。在平均15年的时间里,研究人员监测了参与者的心血管事件。在排除结果数据缺失个体和剔除不必要变量后,共纳入7260名参与者和28个变量。5种ML模型——逻辑回归、随机森林(RF)、支持向量机、极端梯度增强和光梯度增强机——被用于预测冠心病的发病率。通过准确性、敏感性、特异性、精密度、曲线下面积、f1评分、校准曲线、观察期望比和决策曲线分析来评估模型的性能。此外,使用Shapley加性解释(SHAPs)来解释预测模型,并了解各种危险因素对冠心病的贡献。结果:在7260名参与者中,305名(4.2%)被诊断为冠心病。RF模型表现出最高的性能,准确率为0.73 (95% CI 0.64-0.80),灵敏度为0.74 (95% CI 0.62-0.84),特异性为0.72 (95% CI 0.61-0.83),曲线下面积为0.73 (95% CI 0.65-0.80)。RF还显示出出色的校准,预测概率与观察结果密切一致,并在风险阈值范围内提供实质性的净收益,如决策曲线分析所示。SHAP分析阐明了冠心病的关键预测因子,包括颈总动脉内膜-中膜厚度(IMT_cMax)、血压、脂质谱(非高密度脂蛋白胆固醇、高密度脂蛋白胆固醇和甘油三酯)和肾小球滤过率。被确定为冠心病风险重要贡献者的新危险因素包括低钙水平、白细胞计数升高和体脂率。此外,在女性中观察到一种保护作用,这表明在未来的心血管健康评估中可能需要针对性别的风险评估策略。结论:我们建立了一个使用ML预测冠心病的模型,并应用SHAP方法进行解释。该方法强调了冠心病风险评估的多因素性质,旨在支持卫生保健专业人员识别危险因素并制定有效的预防策略。
Machine Learning Model for Predicting Coronary Heart Disease Risk: Development and Validation Using Insights From a Japanese Population-Based Study.
Background: Coronary heart disease (CHD) is a major cause of morbidity and mortality worldwide. Identifying key risk factors is essential for effective risk assessment and prevention. A data-driven approach using machine learning (ML) offers advanced techniques to analyze complex, nonlinear, and high-dimensional datasets, uncovering novel predictors of CHD that go beyond the limitations of traditional models, which rely on predefined variables.
Objective: This study aims to evaluate the contribution of various risk factors to CHD, focusing on both established and novel markers using ML techniques.
Methods: The study recruited 7672 participants aged 30-84 years from Suita City, Japan, between 1989 and 1999. Over an average of 15 years, participants were monitored for cardiovascular events. A total of 7260 participants and 28 variables were included in the analysis after excluding individuals with missing outcome data and eliminating unnecessary variables. Five ML models-logistic regression, random forest (RF), support vector machine, Extreme Gradient Boosting, and Light Gradient-Boosting Machine-were applied for predicting CHD incidence. Model performance was evaluated using accuracy, sensitivity, specificity, precision, area under the curve, F1-score, calibration curves, observed-to-expected ratios, and decision curve analysis. Additionally, Shapley Additive Explanations (SHAPs) were used to interpret the prediction models and understand the contribution of various risk factors to CHD.
Results: Among 7260 participants, 305 (4.2%) were diagnosed with CHD. The RF model demonstrated the highest performance, with an accuracy of 0.73 (95% CI 0.64-0.80), sensitivity of 0.74 (95% CI 0.62-0.84), specificity of 0.72 (95% CI 0.61-0.83), and an area under the curve of 0.73 (95% CI 0.65-0.80). RF also showed excellent calibration, with predicted probabilities closely aligning with observed outcomes, and provided substantial net benefit across a range of risk thresholds, as demonstrated by decision curve analysis. SHAP analysis elucidated key predictors of CHD, including the intima-media thickness (IMT_cMax) of the common carotid artery, blood pressure, lipid profiles (non-high-density lipoprotein cholesterol, high-density lipoprotein cholesterol, and triglycerides), and estimated glomerular filtration rate. Novel risk factors identified as significant contributors to CHD risk included lower calcium levels, elevated white blood cell counts, and body fat percentage. Furthermore, a protective effect was observed in women, suggesting the potential necessity for gender-specific risk assessment strategies in future cardiovascular health evaluations.
Conclusions: We developed a model to predict CHD using ML and applied SHAP methods for interpretation. This approach highlights the multifactor nature of CHD risk evaluation, aiming to support health care professionals in identifying risk factors and formulating effective prevention strategies.