Tree-based classification model for Long-COVID infection prediction with age stratification using data from the National COVID Cohort Collaborative.

IF 2.5 Q2 HEALTH CARE SCIENCES & SERVICES
JAMIA Open Pub Date : 2024-10-09 eCollection Date: 2024-12-01 DOI:10.1093/jamiaopen/ooae111
Will Ke Wang, Hayoung Jeong, Leeor Hershkovich, Peter Cho, Karnika Singh, Lauren Lederer, Ali R Roghanizad, Md Mobashir Hasan Shandhi, Warren Kibbe, Jessilyn Dunn
{"title":"Tree-based classification model for Long-COVID infection prediction with age stratification using data from the National COVID Cohort Collaborative.","authors":"Will Ke Wang, Hayoung Jeong, Leeor Hershkovich, Peter Cho, Karnika Singh, Lauren Lederer, Ali R Roghanizad, Md Mobashir Hasan Shandhi, Warren Kibbe, Jessilyn Dunn","doi":"10.1093/jamiaopen/ooae111","DOIUrl":null,"url":null,"abstract":"<p><strong>Objectives: </strong>We propose and validate a domain knowledge-driven classification model for diagnosing post-acute sequelae of SARS-CoV-2 infection (PASC), also known as Long COVID, using Electronic Health Records (EHRs) data.</p><p><strong>Materials and methods: </strong>We developed a robust model that incorporates features strongly indicative of PASC or associated with the severity of COVID-19 symptoms as identified in our literature review. The XGBoost tree-based architecture was chosen for its ability to handle class-imbalanced data and its potential for high interpretability. Using the training data provided by the Long COVID Computation Challenge (L3C), which was a sample of the National COVID Cohort Collaborative (N3C), our models were fine-tuned and calibrated to optimize Area Under the Receiver Operating characteristic curve (AUROC) and the F1 score, following best practices for the class-imbalanced N3C data.</p><p><strong>Results: </strong>Our age-stratified classification model demonstrated strong performance with an average 5-fold cross-validated AUROC of 0.844 and F1 score of 0.539 across the young adult, mid-aged, and older-aged populations in the training data. In an independent testing dataset, which was made available after the challenge was over, we achieved an overall AUROC score of 0.814 and F1 score of 0.545.</p><p><strong>Discussion: </strong>The results demonstrated the utility of knowledge-driven feature engineering in a sparse EHR data and demographic stratification in model development to diagnose a complex and heterogeneously presenting condition like PASC. The model's architecture, mirroring natural clinician decision-making processes, contributed to its robustness and interpretability, which are crucial for clinical translatability. Further, the model's generalizability was evaluated over a new cross-sectional data as provided in the later stages of the L3C challenge.</p><p><strong>Conclusion: </strong>The study proposed and validated the effectiveness of age-stratified, tree-based classification models to diagnose PASC. Our approach highlights the potential of machine learning in addressing the diagnostic challenges posed by the heterogeneity of Long-COVID symptoms.</p>","PeriodicalId":36278,"journal":{"name":"JAMIA Open","volume":"7 4","pages":"ooae111"},"PeriodicalIF":2.5000,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11547948/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JAMIA Open","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/jamiaopen/ooae111","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/12/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0

Abstract

Objectives: We propose and validate a domain knowledge-driven classification model for diagnosing post-acute sequelae of SARS-CoV-2 infection (PASC), also known as Long COVID, using Electronic Health Records (EHRs) data.

Materials and methods: We developed a robust model that incorporates features strongly indicative of PASC or associated with the severity of COVID-19 symptoms as identified in our literature review. The XGBoost tree-based architecture was chosen for its ability to handle class-imbalanced data and its potential for high interpretability. Using the training data provided by the Long COVID Computation Challenge (L3C), which was a sample of the National COVID Cohort Collaborative (N3C), our models were fine-tuned and calibrated to optimize Area Under the Receiver Operating characteristic curve (AUROC) and the F1 score, following best practices for the class-imbalanced N3C data.

Results: Our age-stratified classification model demonstrated strong performance with an average 5-fold cross-validated AUROC of 0.844 and F1 score of 0.539 across the young adult, mid-aged, and older-aged populations in the training data. In an independent testing dataset, which was made available after the challenge was over, we achieved an overall AUROC score of 0.814 and F1 score of 0.545.

Discussion: The results demonstrated the utility of knowledge-driven feature engineering in a sparse EHR data and demographic stratification in model development to diagnose a complex and heterogeneously presenting condition like PASC. The model's architecture, mirroring natural clinician decision-making processes, contributed to its robustness and interpretability, which are crucial for clinical translatability. Further, the model's generalizability was evaluated over a new cross-sectional data as provided in the later stages of the L3C challenge.

Conclusion: The study proposed and validated the effectiveness of age-stratified, tree-based classification models to diagnose PASC. Our approach highlights the potential of machine learning in addressing the diagnostic challenges posed by the heterogeneity of Long-COVID symptoms.

利用国家 COVID 队列协作组织的数据,建立基于树分类的长 COVID 感染预测模型,并进行年龄分层。
目的:我们利用电子健康记录(EHR)数据,提出并验证了一种领域知识驱动的分类模型,用于诊断 SARS-CoV-2 感染后的急性后遗症(PASC),也称为长 COVID:我们开发了一个稳健的模型,该模型包含了文献综述中确定的强烈提示 PASC 或与 COVID-19 症状严重程度相关的特征。之所以选择基于 XGBoost 树的架构,是因为它能够处理类不平衡数据,并具有较高的可解释性。利用 Long COVID 计算挑战赛(L3C)提供的训练数据(L3C 是全国 COVID 队列协作组织(N3C)的一个样本),我们对模型进行了微调和校准,以优化接收者工作特征曲线下面积(AUROC)和 F1 分数,并遵循 N3C 数据的类不平衡最佳实践:我们的年龄分层分类模型表现出色,在训练数据中,青壮年、中年和老年群体的 5 倍交叉验证平均 AUROC 为 0.844,F1 得分为 0.539。在挑战赛结束后提供的独立测试数据集中,我们取得了 0.814 的总 AUROC 分和 0.545 的 F1 分:结果表明,在稀疏的电子病历数据中采用知识驱动的特征工程以及在模型开发过程中进行人口分层,对于诊断像 PASC 这样复杂且异质性的病症非常有用。该模型的结构反映了临床医生的自然决策过程,有助于提高其稳健性和可解释性,这对临床转化至关重要。此外,该模型的可推广性还通过 L3C 挑战赛后期提供的新横截面数据进行了评估:本研究提出并验证了基于树状结构的年龄分层分类模型诊断 PASC 的有效性。我们的方法凸显了机器学习在应对长期慢性阻塞性肺疾病症状异质性所带来的诊断挑战方面的潜力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
JAMIA Open
JAMIA Open Medicine-Health Informatics
CiteScore
4.10
自引率
4.80%
发文量
102
审稿时长
16 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信