Machine Learning and Natural Language Processing to Improve Classification of Atrial Septal Defects in Electronic Health Records

IF 1.6 4区 医学 Q4 DEVELOPMENTAL BIOLOGY
Yuting Guo, Haoming Shi, Wendy M. Book, Lindsey Carrie Ivey, Fred H. Rodriguez III, Reza Sameni, Cheryl Raskind-Hood, Chad Robichaux, Karrie F. Downing, Abeed Sarker
{"title":"Machine Learning and Natural Language Processing to Improve Classification of Atrial Septal Defects in Electronic Health Records","authors":"Yuting Guo,&nbsp;Haoming Shi,&nbsp;Wendy M. Book,&nbsp;Lindsey Carrie Ivey,&nbsp;Fred H. Rodriguez III,&nbsp;Reza Sameni,&nbsp;Cheryl Raskind-Hood,&nbsp;Chad Robichaux,&nbsp;Karrie F. Downing,&nbsp;Abeed Sarker","doi":"10.1002/bdr2.2451","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Background</h3>\n \n <p>International Classification of Disease (ICD) codes can accurately identify patients with certain congenital heart defects (CHDs). In ICD-defined CHD data sets, the code for secundum atrial septal defect (ASD) is the most common, but it has a low positive predictive value for CHD, potentially resulting in the drawing of erroneous conclusions from such data sets. Methods with reduced false positive rates for CHD among individuals captured with the ASD ICD code are needed for public health surveillance.</p>\n </section>\n \n <section>\n \n <h3> Methods</h3>\n \n <p>We propose a two-level classification system, which includes a CHD and an ASD classification model, to categorize cases with an ASD ICD code into three groups: ASD, other CHD, or no CHD (including patent foramen ovale). In the proposed approach, a machine learning model that leverages structured data is combined with a text classification system. We compare performances for three text classification strategies: support vector machines (SVMs) using text-based features, a robustly optimized Transformer-based model (RoBERTa), and a scalable tree boosting system using non-text-based features (XGBoost).</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>Using SVM for both CHD and ASD resulted in the best performance for the ASD and no CHD group, achieving <i>F</i><sub>1</sub> scores of 0.53 (±0.05) and 0.78 (±0.02), respectively. XGBoost for CHD and SVM for ASD classification performed best for the other CHD group (<i>F</i><sub>1</sub> score: 0.39 [±0.03]).</p>\n </section>\n \n <section>\n \n <h3> Conclusions</h3>\n \n <p>This study demonstrates that it is feasible to use patients' clinical notes and machine learning to perform more fine-grained classification compared to ICD codes, particularly with higher PPV for CHD. The proposed approach can improve CHD surveillance.</p>\n </section>\n </div>","PeriodicalId":9121,"journal":{"name":"Birth Defects Research","volume":"117 3","pages":""},"PeriodicalIF":1.6000,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Birth Defects Research","FirstCategoryId":"3","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/bdr2.2451","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"DEVELOPMENTAL BIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Background

International Classification of Disease (ICD) codes can accurately identify patients with certain congenital heart defects (CHDs). In ICD-defined CHD data sets, the code for secundum atrial septal defect (ASD) is the most common, but it has a low positive predictive value for CHD, potentially resulting in the drawing of erroneous conclusions from such data sets. Methods with reduced false positive rates for CHD among individuals captured with the ASD ICD code are needed for public health surveillance.

Methods

We propose a two-level classification system, which includes a CHD and an ASD classification model, to categorize cases with an ASD ICD code into three groups: ASD, other CHD, or no CHD (including patent foramen ovale). In the proposed approach, a machine learning model that leverages structured data is combined with a text classification system. We compare performances for three text classification strategies: support vector machines (SVMs) using text-based features, a robustly optimized Transformer-based model (RoBERTa), and a scalable tree boosting system using non-text-based features (XGBoost).

Results

Using SVM for both CHD and ASD resulted in the best performance for the ASD and no CHD group, achieving F1 scores of 0.53 (±0.05) and 0.78 (±0.02), respectively. XGBoost for CHD and SVM for ASD classification performed best for the other CHD group (F1 score: 0.39 [±0.03]).

Conclusions

This study demonstrates that it is feasible to use patients' clinical notes and machine learning to perform more fine-grained classification compared to ICD codes, particularly with higher PPV for CHD. The proposed approach can improve CHD surveillance.

用机器学习和自然语言处理改进电子健康记录中的房室隔缺损分类
国际疾病分类(ICD)代码可以准确识别某些先天性心脏缺陷(CHDs)患者。在icd定义的冠心病数据集中,二次房间隔缺损(ASD)的编码是最常见的,但它对冠心病的阳性预测值较低,可能导致从这些数据集中得出错误的结论。公共卫生监测需要在使用ASD ICD代码捕获的个体中降低冠心病假阳性率的方法。方法提出一种包括冠心病和ASD分类模型的两级分类系统,将ASD ICD编码的病例分为ASD、其他冠心病和无冠心病(包括卵圆孔未闭)三组。在提出的方法中,利用结构化数据的机器学习模型与文本分类系统相结合。我们比较了三种文本分类策略的性能:使用基于文本的特征的支持向量机(svm)、基于transformer的稳健优化模型(RoBERTa)和使用非基于文本的特征的可扩展树提升系统(XGBoost)。结果支持向量机在冠心病组和ASD组的F1评分分别为0.53(±0.05)分和0.78(±0.02)分,ASD组和无冠心病组的评分效果最好。其他冠心病组采用XGBoost和SVM进行ASD分类效果最佳(F1分:0.39[±0.03])。本研究表明,与ICD代码相比,使用患者的临床记录和机器学习进行更细粒度的分类是可行的,特别是冠心病的PPV较高。提出的方法可以改善冠心病的监测。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Birth Defects Research
Birth Defects Research Medicine-Embryology
CiteScore
3.60
自引率
9.50%
发文量
153
期刊介绍: The journal Birth Defects Research publishes original research and reviews in areas related to the etiology of adverse developmental and reproductive outcome. In particular the journal is devoted to the publication of original scientific research that contributes to the understanding of the biology of embryonic development and the prenatal causative factors and mechanisms leading to adverse pregnancy outcomes, namely structural and functional birth defects, pregnancy loss, postnatal functional defects in the human population, and to the identification of prenatal factors and biological mechanisms that reduce these risks. Adverse reproductive and developmental outcomes may have genetic, environmental, nutritional or epigenetic causes. Accordingly, the journal Birth Defects Research takes an integrated, multidisciplinary approach in its organization and publication strategy. The journal Birth Defects Research contains separate sections for clinical and molecular teratology, developmental and reproductive toxicology, and reviews in developmental biology to acknowledge and accommodate the integrative nature of research in this field. Each section has a dedicated editor who is a leader in his/her field and who has full editorial authority in his/her area.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信