用机器学习和自然语言处理改进电子健康记录中的房室隔缺损分类

IF 1.6 4区 医学 Q4 DEVELOPMENTAL BIOLOGY
Yuting Guo, Haoming Shi, Wendy M. Book, Lindsey Carrie Ivey, Fred H. Rodriguez III, Reza Sameni, Cheryl Raskind-Hood, Chad Robichaux, Karrie F. Downing, Abeed Sarker
{"title":"用机器学习和自然语言处理改进电子健康记录中的房室隔缺损分类","authors":"Yuting Guo,&nbsp;Haoming Shi,&nbsp;Wendy M. Book,&nbsp;Lindsey Carrie Ivey,&nbsp;Fred H. Rodriguez III,&nbsp;Reza Sameni,&nbsp;Cheryl Raskind-Hood,&nbsp;Chad Robichaux,&nbsp;Karrie F. Downing,&nbsp;Abeed Sarker","doi":"10.1002/bdr2.2451","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Background</h3>\n \n <p>International Classification of Disease (ICD) codes can accurately identify patients with certain congenital heart defects (CHDs). In ICD-defined CHD data sets, the code for secundum atrial septal defect (ASD) is the most common, but it has a low positive predictive value for CHD, potentially resulting in the drawing of erroneous conclusions from such data sets. Methods with reduced false positive rates for CHD among individuals captured with the ASD ICD code are needed for public health surveillance.</p>\n </section>\n \n <section>\n \n <h3> Methods</h3>\n \n <p>We propose a two-level classification system, which includes a CHD and an ASD classification model, to categorize cases with an ASD ICD code into three groups: ASD, other CHD, or no CHD (including patent foramen ovale). In the proposed approach, a machine learning model that leverages structured data is combined with a text classification system. We compare performances for three text classification strategies: support vector machines (SVMs) using text-based features, a robustly optimized Transformer-based model (RoBERTa), and a scalable tree boosting system using non-text-based features (XGBoost).</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>Using SVM for both CHD and ASD resulted in the best performance for the ASD and no CHD group, achieving <i>F</i><sub>1</sub> scores of 0.53 (±0.05) and 0.78 (±0.02), respectively. XGBoost for CHD and SVM for ASD classification performed best for the other CHD group (<i>F</i><sub>1</sub> score: 0.39 [±0.03]).</p>\n </section>\n \n <section>\n \n <h3> Conclusions</h3>\n \n <p>This study demonstrates that it is feasible to use patients' clinical notes and machine learning to perform more fine-grained classification compared to ICD codes, particularly with higher PPV for CHD. The proposed approach can improve CHD surveillance.</p>\n </section>\n </div>","PeriodicalId":9121,"journal":{"name":"Birth Defects Research","volume":"117 3","pages":""},"PeriodicalIF":1.6000,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Machine Learning and Natural Language Processing to Improve Classification of Atrial Septal Defects in Electronic Health Records\",\"authors\":\"Yuting Guo,&nbsp;Haoming Shi,&nbsp;Wendy M. Book,&nbsp;Lindsey Carrie Ivey,&nbsp;Fred H. Rodriguez III,&nbsp;Reza Sameni,&nbsp;Cheryl Raskind-Hood,&nbsp;Chad Robichaux,&nbsp;Karrie F. Downing,&nbsp;Abeed Sarker\",\"doi\":\"10.1002/bdr2.2451\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div>\\n \\n \\n <section>\\n \\n <h3> Background</h3>\\n \\n <p>International Classification of Disease (ICD) codes can accurately identify patients with certain congenital heart defects (CHDs). In ICD-defined CHD data sets, the code for secundum atrial septal defect (ASD) is the most common, but it has a low positive predictive value for CHD, potentially resulting in the drawing of erroneous conclusions from such data sets. Methods with reduced false positive rates for CHD among individuals captured with the ASD ICD code are needed for public health surveillance.</p>\\n </section>\\n \\n <section>\\n \\n <h3> Methods</h3>\\n \\n <p>We propose a two-level classification system, which includes a CHD and an ASD classification model, to categorize cases with an ASD ICD code into three groups: ASD, other CHD, or no CHD (including patent foramen ovale). In the proposed approach, a machine learning model that leverages structured data is combined with a text classification system. We compare performances for three text classification strategies: support vector machines (SVMs) using text-based features, a robustly optimized Transformer-based model (RoBERTa), and a scalable tree boosting system using non-text-based features (XGBoost).</p>\\n </section>\\n \\n <section>\\n \\n <h3> Results</h3>\\n \\n <p>Using SVM for both CHD and ASD resulted in the best performance for the ASD and no CHD group, achieving <i>F</i><sub>1</sub> scores of 0.53 (±0.05) and 0.78 (±0.02), respectively. XGBoost for CHD and SVM for ASD classification performed best for the other CHD group (<i>F</i><sub>1</sub> score: 0.39 [±0.03]).</p>\\n </section>\\n \\n <section>\\n \\n <h3> Conclusions</h3>\\n \\n <p>This study demonstrates that it is feasible to use patients' clinical notes and machine learning to perform more fine-grained classification compared to ICD codes, particularly with higher PPV for CHD. The proposed approach can improve CHD surveillance.</p>\\n </section>\\n </div>\",\"PeriodicalId\":9121,\"journal\":{\"name\":\"Birth Defects Research\",\"volume\":\"117 3\",\"pages\":\"\"},\"PeriodicalIF\":1.6000,\"publicationDate\":\"2025-03-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Birth Defects Research\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1002/bdr2.2451\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"DEVELOPMENTAL BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Birth Defects Research","FirstCategoryId":"3","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/bdr2.2451","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"DEVELOPMENTAL BIOLOGY","Score":null,"Total":0}
引用次数: 0

摘要

本文章由计算机程序翻译,如有差异,请以英文原文为准。
Machine Learning and Natural Language Processing to Improve Classification of Atrial Septal Defects in Electronic Health Records

Background

International Classification of Disease (ICD) codes can accurately identify patients with certain congenital heart defects (CHDs). In ICD-defined CHD data sets, the code for secundum atrial septal defect (ASD) is the most common, but it has a low positive predictive value for CHD, potentially resulting in the drawing of erroneous conclusions from such data sets. Methods with reduced false positive rates for CHD among individuals captured with the ASD ICD code are needed for public health surveillance.

Methods

We propose a two-level classification system, which includes a CHD and an ASD classification model, to categorize cases with an ASD ICD code into three groups: ASD, other CHD, or no CHD (including patent foramen ovale). In the proposed approach, a machine learning model that leverages structured data is combined with a text classification system. We compare performances for three text classification strategies: support vector machines (SVMs) using text-based features, a robustly optimized Transformer-based model (RoBERTa), and a scalable tree boosting system using non-text-based features (XGBoost).

Results

Using SVM for both CHD and ASD resulted in the best performance for the ASD and no CHD group, achieving F1 scores of 0.53 (±0.05) and 0.78 (±0.02), respectively. XGBoost for CHD and SVM for ASD classification performed best for the other CHD group (F1 score: 0.39 [±0.03]).

Conclusions

This study demonstrates that it is feasible to use patients' clinical notes and machine learning to perform more fine-grained classification compared to ICD codes, particularly with higher PPV for CHD. The proposed approach can improve CHD surveillance.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Birth Defects Research
Birth Defects Research Medicine-Embryology
CiteScore
3.60
自引率
9.50%
发文量
153
期刊介绍: The journal Birth Defects Research publishes original research and reviews in areas related to the etiology of adverse developmental and reproductive outcome. In particular the journal is devoted to the publication of original scientific research that contributes to the understanding of the biology of embryonic development and the prenatal causative factors and mechanisms leading to adverse pregnancy outcomes, namely structural and functional birth defects, pregnancy loss, postnatal functional defects in the human population, and to the identification of prenatal factors and biological mechanisms that reduce these risks. Adverse reproductive and developmental outcomes may have genetic, environmental, nutritional or epigenetic causes. Accordingly, the journal Birth Defects Research takes an integrated, multidisciplinary approach in its organization and publication strategy. The journal Birth Defects Research contains separate sections for clinical and molecular teratology, developmental and reproductive toxicology, and reviews in developmental biology to acknowledge and accommodate the integrative nature of research in this field. Each section has a dedicated editor who is a leader in his/her field and who has full editorial authority in his/her area.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信