The Impact of Time Horizon on Classification Accuracy: Application of Machine Learning to Prediction of Incident Coronary Heart Disease.

Q2 Medicine
JMIR Cardio Pub Date : 2022-11-02 DOI:10.2196/38040
Steven Simon, Divneet Mandair, Abdel Albakri, Alison Fohner, Noah Simon, Leslie Lange, Mary Biggs, Kenneth Mukamal, Bruce Psaty, Michael Rosenberg
{"title":"The Impact of Time Horizon on Classification Accuracy: Application of Machine Learning to Prediction of Incident Coronary Heart Disease.","authors":"Steven Simon,&nbsp;Divneet Mandair,&nbsp;Abdel Albakri,&nbsp;Alison Fohner,&nbsp;Noah Simon,&nbsp;Leslie Lange,&nbsp;Mary Biggs,&nbsp;Kenneth Mukamal,&nbsp;Bruce Psaty,&nbsp;Michael Rosenberg","doi":"10.2196/38040","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Many machine learning approaches are limited to classification of outcomes rather than longitudinal prediction. One strategy to use machine learning in clinical risk prediction is to classify outcomes over a given time horizon. However, it is not well-known how to identify the optimal time horizon for risk prediction.</p><p><strong>Objective: </strong>In this study, we aim to identify an optimal time horizon for classification of incident myocardial infarction (MI) using machine learning approaches looped over outcomes with increasing time horizons. Additionally, we sought to compare the performance of these models with the traditional Framingham Heart Study (FHS) coronary heart disease gender-specific Cox proportional hazards regression model.</p><p><strong>Methods: </strong>We analyzed data from a single clinic visit of 5201 participants of a cardiovascular health study. We examined 61 variables collected from this baseline exam, including demographic and biologic data, medical history, medications, serum biomarkers, electrocardiographic, and echocardiographic data. We compared several machine learning methods (eg, random forest, L1 regression, gradient boosted decision tree, support vector machine, and k-nearest neighbor) trained to predict incident MI that occurred within time horizons ranging from 500-10,000 days of follow-up. Models were compared on a 20% held-out testing set using area under the receiver operating characteristic curve (AUROC). Variable importance was performed for random forest and L1 regression models across time points. We compared results with the FHS coronary heart disease gender-specific Cox proportional hazards regression functions.</p><p><strong>Results: </strong>There were 4190 participants included in the analysis, with 2522 (60.2%) female participants and an average age of 72.6 years. Over 10,000 days of follow-up, there were 813 incident MI events. The machine learning models were most predictive over moderate follow-up time horizons (ie, 1500-2500 days). Overall, the L1 (Lasso) logistic regression demonstrated the strongest classification accuracy across all time horizons. This model was most predictive at 1500 days follow-up, with an AUROC of 0.71. The most influential variables differed by follow-up time and model, with gender being the most important feature for the L1 regression and weight for the random forest model across all time frames. Compared with the Framingham Cox function, the L1 and random forest models performed better across all time frames beyond 1500 days.</p><p><strong>Conclusions: </strong>In a population free of coronary heart disease, machine learning techniques can be used to predict incident MI at varying time horizons with reasonable accuracy, with the strongest prediction accuracy in moderate follow-up periods. Validation across additional populations is needed to confirm the validity of this approach in risk prediction.</p>","PeriodicalId":14706,"journal":{"name":"JMIR Cardio","volume":" ","pages":"e38040"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9669890/pdf/","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Cardio","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/38040","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Medicine","Score":null,"Total":0}
引用次数: 1

Abstract

Background: Many machine learning approaches are limited to classification of outcomes rather than longitudinal prediction. One strategy to use machine learning in clinical risk prediction is to classify outcomes over a given time horizon. However, it is not well-known how to identify the optimal time horizon for risk prediction.

Objective: In this study, we aim to identify an optimal time horizon for classification of incident myocardial infarction (MI) using machine learning approaches looped over outcomes with increasing time horizons. Additionally, we sought to compare the performance of these models with the traditional Framingham Heart Study (FHS) coronary heart disease gender-specific Cox proportional hazards regression model.

Methods: We analyzed data from a single clinic visit of 5201 participants of a cardiovascular health study. We examined 61 variables collected from this baseline exam, including demographic and biologic data, medical history, medications, serum biomarkers, electrocardiographic, and echocardiographic data. We compared several machine learning methods (eg, random forest, L1 regression, gradient boosted decision tree, support vector machine, and k-nearest neighbor) trained to predict incident MI that occurred within time horizons ranging from 500-10,000 days of follow-up. Models were compared on a 20% held-out testing set using area under the receiver operating characteristic curve (AUROC). Variable importance was performed for random forest and L1 regression models across time points. We compared results with the FHS coronary heart disease gender-specific Cox proportional hazards regression functions.

Results: There were 4190 participants included in the analysis, with 2522 (60.2%) female participants and an average age of 72.6 years. Over 10,000 days of follow-up, there were 813 incident MI events. The machine learning models were most predictive over moderate follow-up time horizons (ie, 1500-2500 days). Overall, the L1 (Lasso) logistic regression demonstrated the strongest classification accuracy across all time horizons. This model was most predictive at 1500 days follow-up, with an AUROC of 0.71. The most influential variables differed by follow-up time and model, with gender being the most important feature for the L1 regression and weight for the random forest model across all time frames. Compared with the Framingham Cox function, the L1 and random forest models performed better across all time frames beyond 1500 days.

Conclusions: In a population free of coronary heart disease, machine learning techniques can be used to predict incident MI at varying time horizons with reasonable accuracy, with the strongest prediction accuracy in moderate follow-up periods. Validation across additional populations is needed to confirm the validity of this approach in risk prediction.

Abstract Image

Abstract Image

Abstract Image

时间范围对分类准确性的影响:机器学习在冠心病事件预测中的应用。
背景:许多机器学习方法局限于结果分类,而不是纵向预测。在临床风险预测中使用机器学习的一种策略是在给定的时间范围内对结果进行分类。然而,如何确定风险预测的最佳时间范围并不为人所知。目的:在本研究中,我们的目标是使用机器学习方法在增加的时间范围内循环结果来确定事件心肌梗死(MI)分类的最佳时间范围。此外,我们试图将这些模型的性能与传统的弗雷明汉心脏研究(FHS)冠心病性别特异性Cox比例风险回归模型进行比较。方法:我们分析了5201名心血管健康研究参与者的单次门诊就诊数据。我们检查了从基线检查中收集的61个变量,包括人口统计学和生物学数据、病史、药物、血清生物标志物、心电图和超声心动图数据。我们比较了几种机器学习方法(例如,随机森林、L1回归、梯度增强决策树、支持向量机和k近邻),这些方法经过训练,可以预测在500-10,000天的随访时间范围内发生的MI事件。使用受试者工作特征曲线下面积(AUROC)在20%的测试集上对模型进行比较。对随机森林和L1回归模型进行跨时间点的变量重要性分析。我们将结果与FHS冠心病性别Cox比例风险回归函数进行比较。结果:共纳入4190例受试者,其中女性2522例(60.2%),平均年龄72.6岁。在1万多天的随访中,有813例心梗事件。机器学习模型在中等随访时间范围内(即1500-2500天)最具预测性。总体而言,L1 (Lasso)逻辑回归在所有时间范围内表现出最强的分类准确性。该模型在随访1500天时最具预测性,AUROC为0.71。最具影响力的变量因随访时间和模型而异,性别是L1回归最重要的特征,而随机森林模型的权重在所有时间框架内都是最重要的特征。与Framingham Cox函数相比,L1和随机森林模型在超过1500天的所有时间框架内表现更好。结论:在没有冠心病的人群中,机器学习技术可以在不同的时间范围内以合理的准确性预测心肌梗死的发生,在中等随访期的预测准确性最强。需要在其他人群中进行验证,以确认该方法在风险预测中的有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
JMIR Cardio
JMIR Cardio Computer Science-Computer Science Applications
CiteScore
3.50
自引率
0.00%
发文量
25
审稿时长
12 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信