Machine Learning and Natural Language Processing to Improve Classification of Atrial Septal Defects in Electronic Health Records

IF 1.6 4区医学 Q4 DEVELOPMENTAL BIOLOGY

Birth Defects Research Pub Date : 2025-03-04 DOI:10.1002/bdr2.2451

Yuting Guo, Haoming Shi, Wendy M. Book, Lindsey Carrie Ivey, Fred H. Rodriguez III, Reza Sameni, Cheryl Raskind-Hood, Chad Robichaux, Karrie F. Downing, Abeed Sarker

{"title":"Machine Learning and Natural Language Processing to Improve Classification of Atrial Septal Defects in Electronic Health Records","authors":"Yuting Guo, Haoming Shi, Wendy M. Book, Lindsey Carrie Ivey, Fred H. Rodriguez III, Reza Sameni, Cheryl Raskind-Hood, Chad Robichaux, Karrie F. Downing, Abeed Sarker","doi":"10.1002/bdr2.2451","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Background</h3>\n \n <p>International Classification of Disease (ICD) codes can accurately identify patients with certain congenital heart defects (CHDs). In ICD-defined CHD data sets, the code for secundum atrial septal defect (ASD) is the most common, but it has a low positive predictive value for CHD, potentially resulting in the drawing of erroneous conclusions from such data sets. Methods with reduced false positive rates for CHD among individuals captured with the ASD ICD code are needed for public health surveillance.</p>\n </section>\n \n <section>\n \n <h3> Methods</h3>\n \n <p>We propose a two-level classification system, which includes a CHD and an ASD classification model, to categorize cases with an ASD ICD code into three groups: ASD, other CHD, or no CHD (including patent foramen ovale). In the proposed approach, a machine learning model that leverages structured data is combined with a text classification system. We compare performances for three text classification strategies: support vector machines (SVMs) using text-based features, a robustly optimized Transformer-based model (RoBERTa), and a scalable tree boosting system using non-text-based features (XGBoost).</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>Using SVM for both CHD and ASD resulted in the best performance for the ASD and no CHD group, achieving <i>F</i><sub>1</sub> scores of 0.53 (±0.05) and 0.78 (±0.02), respectively. XGBoost for CHD and SVM for ASD classification performed best for the other CHD group (<i>F</i><sub>1</sub> score: 0.39 [±0.03]).</p>\n </section>\n \n <section>\n \n <h3> Conclusions</h3>\n \n <p>This study demonstrates that it is feasible to use patients' clinical notes and machine learning to perform more fine-grained classification compared to ICD codes, particularly with higher PPV for CHD. The proposed approach can improve CHD surveillance.</p>\n </section>\n </div>","PeriodicalId":9121,"journal":{"name":"Birth Defects Research","volume":"117 3","pages":""},"PeriodicalIF":1.6000,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Birth Defects Research","FirstCategoryId":"3","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/bdr2.2451","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"DEVELOPMENTAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Background

International Classification of Disease (ICD) codes can accurately identify patients with certain congenital heart defects (CHDs). In ICD-defined CHD data sets, the code for secundum atrial septal defect (ASD) is the most common, but it has a low positive predictive value for CHD, potentially resulting in the drawing of erroneous conclusions from such data sets. Methods with reduced false positive rates for CHD among individuals captured with the ASD ICD code are needed for public health surveillance.

Methods

We propose a two-level classification system, which includes a CHD and an ASD classification model, to categorize cases with an ASD ICD code into three groups: ASD, other CHD, or no CHD (including patent foramen ovale). In the proposed approach, a machine learning model that leverages structured data is combined with a text classification system. We compare performances for three text classification strategies: support vector machines (SVMs) using text-based features, a robustly optimized Transformer-based model (RoBERTa), and a scalable tree boosting system using non-text-based features (XGBoost).

Results

Using SVM for both CHD and ASD resulted in the best performance for the ASD and no CHD group, achieving F₁ scores of 0.53 (±0.05) and 0.78 (±0.02), respectively. XGBoost for CHD and SVM for ASD classification performed best for the other CHD group (F₁ score: 0.39 [±0.03]).

Conclusions

This study demonstrates that it is feasible to use patients' clinical notes and machine learning to perform more fine-grained classification compared to ICD codes, particularly with higher PPV for CHD. The proposed approach can improve CHD surveillance.

查看原文本刊更多论文

用机器学习和自然语言处理改进电子健康记录中的房室隔缺损分类

国际疾病分类（ICD）代码可以准确识别某些先天性心脏缺陷（CHDs）患者。在icd定义的冠心病数据集中，二次房间隔缺损（ASD）的编码是最常见的，但它对冠心病的阳性预测值较低，可能导致从这些数据集中得出错误的结论。公共卫生监测需要在使用ASD ICD代码捕获的个体中降低冠心病假阳性率的方法。方法提出一种包括冠心病和ASD分类模型的两级分类系统，将ASD ICD编码的病例分为ASD、其他冠心病和无冠心病（包括卵圆孔未闭）三组。在提出的方法中，利用结构化数据的机器学习模型与文本分类系统相结合。我们比较了三种文本分类策略的性能：使用基于文本的特征的支持向量机（svm）、基于transformer的稳健优化模型（RoBERTa）和使用非基于文本的特征的可扩展树提升系统（XGBoost）。结果支持向量机在冠心病组和ASD组的F1评分分别为0.53（±0.05）分和0.78（±0.02）分，ASD组和无冠心病组的评分效果最好。其他冠心病组采用XGBoost和SVM进行ASD分类效果最佳（F1分：0.39[±0.03]）。本研究表明，与ICD代码相比，使用患者的临床记录和机器学习进行更细粒度的分类是可行的，特别是冠心病的PPV较高。提出的方法可以改善冠心病的监测。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Birth Defects Research Medicine-Embryology

CiteScore

3.60

自引率

9.50%

发文量

153

期刊介绍： The journal Birth Defects Research publishes original research and reviews in areas related to the etiology of adverse developmental and reproductive outcome. In particular the journal is devoted to the publication of original scientific research that contributes to the understanding of the biology of embryonic development and the prenatal causative factors and mechanisms leading to adverse pregnancy outcomes, namely structural and functional birth defects, pregnancy loss, postnatal functional defects in the human population, and to the identification of prenatal factors and biological mechanisms that reduce these risks. Adverse reproductive and developmental outcomes may have genetic, environmental, nutritional or epigenetic causes. Accordingly, the journal Birth Defects Research takes an integrated, multidisciplinary approach in its organization and publication strategy. The journal Birth Defects Research contains separate sections for clinical and molecular teratology, developmental and reproductive toxicology, and reviews in developmental biology to acknowledge and accommodate the integrative nature of research in this field. Each section has a dedicated editor who is a leader in his/her field and who has full editorial authority in his/her area.