Yuting Guo, Haoming Shi, Wendy M. Book, Lindsey Carrie Ivey, Fred H. Rodriguez III, Reza Sameni, Cheryl Raskind-Hood, Chad Robichaux, Karrie F. Downing, Abeed Sarker
{"title":"用机器学习和自然语言处理改进电子健康记录中的房室隔缺损分类","authors":"Yuting Guo, Haoming Shi, Wendy M. Book, Lindsey Carrie Ivey, Fred H. Rodriguez III, Reza Sameni, Cheryl Raskind-Hood, Chad Robichaux, Karrie F. Downing, Abeed Sarker","doi":"10.1002/bdr2.2451","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Background</h3>\n \n <p>International Classification of Disease (ICD) codes can accurately identify patients with certain congenital heart defects (CHDs). In ICD-defined CHD data sets, the code for secundum atrial septal defect (ASD) is the most common, but it has a low positive predictive value for CHD, potentially resulting in the drawing of erroneous conclusions from such data sets. Methods with reduced false positive rates for CHD among individuals captured with the ASD ICD code are needed for public health surveillance.</p>\n </section>\n \n <section>\n \n <h3> Methods</h3>\n \n <p>We propose a two-level classification system, which includes a CHD and an ASD classification model, to categorize cases with an ASD ICD code into three groups: ASD, other CHD, or no CHD (including patent foramen ovale). In the proposed approach, a machine learning model that leverages structured data is combined with a text classification system. We compare performances for three text classification strategies: support vector machines (SVMs) using text-based features, a robustly optimized Transformer-based model (RoBERTa), and a scalable tree boosting system using non-text-based features (XGBoost).</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>Using SVM for both CHD and ASD resulted in the best performance for the ASD and no CHD group, achieving <i>F</i><sub>1</sub> scores of 0.53 (±0.05) and 0.78 (±0.02), respectively. XGBoost for CHD and SVM for ASD classification performed best for the other CHD group (<i>F</i><sub>1</sub> score: 0.39 [±0.03]).</p>\n </section>\n \n <section>\n \n <h3> Conclusions</h3>\n \n <p>This study demonstrates that it is feasible to use patients' clinical notes and machine learning to perform more fine-grained classification compared to ICD codes, particularly with higher PPV for CHD. The proposed approach can improve CHD surveillance.</p>\n </section>\n </div>","PeriodicalId":9121,"journal":{"name":"Birth Defects Research","volume":"117 3","pages":""},"PeriodicalIF":1.6000,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Machine Learning and Natural Language Processing to Improve Classification of Atrial Septal Defects in Electronic Health Records\",\"authors\":\"Yuting Guo, Haoming Shi, Wendy M. Book, Lindsey Carrie Ivey, Fred H. Rodriguez III, Reza Sameni, Cheryl Raskind-Hood, Chad Robichaux, Karrie F. Downing, Abeed Sarker\",\"doi\":\"10.1002/bdr2.2451\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div>\\n \\n \\n <section>\\n \\n <h3> Background</h3>\\n \\n <p>International Classification of Disease (ICD) codes can accurately identify patients with certain congenital heart defects (CHDs). In ICD-defined CHD data sets, the code for secundum atrial septal defect (ASD) is the most common, but it has a low positive predictive value for CHD, potentially resulting in the drawing of erroneous conclusions from such data sets. Methods with reduced false positive rates for CHD among individuals captured with the ASD ICD code are needed for public health surveillance.</p>\\n </section>\\n \\n <section>\\n \\n <h3> Methods</h3>\\n \\n <p>We propose a two-level classification system, which includes a CHD and an ASD classification model, to categorize cases with an ASD ICD code into three groups: ASD, other CHD, or no CHD (including patent foramen ovale). In the proposed approach, a machine learning model that leverages structured data is combined with a text classification system. We compare performances for three text classification strategies: support vector machines (SVMs) using text-based features, a robustly optimized Transformer-based model (RoBERTa), and a scalable tree boosting system using non-text-based features (XGBoost).</p>\\n </section>\\n \\n <section>\\n \\n <h3> Results</h3>\\n \\n <p>Using SVM for both CHD and ASD resulted in the best performance for the ASD and no CHD group, achieving <i>F</i><sub>1</sub> scores of 0.53 (±0.05) and 0.78 (±0.02), respectively. XGBoost for CHD and SVM for ASD classification performed best for the other CHD group (<i>F</i><sub>1</sub> score: 0.39 [±0.03]).</p>\\n </section>\\n \\n <section>\\n \\n <h3> Conclusions</h3>\\n \\n <p>This study demonstrates that it is feasible to use patients' clinical notes and machine learning to perform more fine-grained classification compared to ICD codes, particularly with higher PPV for CHD. The proposed approach can improve CHD surveillance.</p>\\n </section>\\n </div>\",\"PeriodicalId\":9121,\"journal\":{\"name\":\"Birth Defects Research\",\"volume\":\"117 3\",\"pages\":\"\"},\"PeriodicalIF\":1.6000,\"publicationDate\":\"2025-03-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Birth Defects Research\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1002/bdr2.2451\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"DEVELOPMENTAL BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Birth Defects Research","FirstCategoryId":"3","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/bdr2.2451","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"DEVELOPMENTAL BIOLOGY","Score":null,"Total":0}
Machine Learning and Natural Language Processing to Improve Classification of Atrial Septal Defects in Electronic Health Records
Background
International Classification of Disease (ICD) codes can accurately identify patients with certain congenital heart defects (CHDs). In ICD-defined CHD data sets, the code for secundum atrial septal defect (ASD) is the most common, but it has a low positive predictive value for CHD, potentially resulting in the drawing of erroneous conclusions from such data sets. Methods with reduced false positive rates for CHD among individuals captured with the ASD ICD code are needed for public health surveillance.
Methods
We propose a two-level classification system, which includes a CHD and an ASD classification model, to categorize cases with an ASD ICD code into three groups: ASD, other CHD, or no CHD (including patent foramen ovale). In the proposed approach, a machine learning model that leverages structured data is combined with a text classification system. We compare performances for three text classification strategies: support vector machines (SVMs) using text-based features, a robustly optimized Transformer-based model (RoBERTa), and a scalable tree boosting system using non-text-based features (XGBoost).
Results
Using SVM for both CHD and ASD resulted in the best performance for the ASD and no CHD group, achieving F1 scores of 0.53 (±0.05) and 0.78 (±0.02), respectively. XGBoost for CHD and SVM for ASD classification performed best for the other CHD group (F1 score: 0.39 [±0.03]).
Conclusions
This study demonstrates that it is feasible to use patients' clinical notes and machine learning to perform more fine-grained classification compared to ICD codes, particularly with higher PPV for CHD. The proposed approach can improve CHD surveillance.
期刊介绍:
The journal Birth Defects Research publishes original research and reviews in areas related to the etiology of adverse developmental and reproductive outcome. In particular the journal is devoted to the publication of original scientific research that contributes to the understanding of the biology of embryonic development and the prenatal causative factors and mechanisms leading to adverse pregnancy outcomes, namely structural and functional birth defects, pregnancy loss, postnatal functional defects in the human population, and to the identification of prenatal factors and biological mechanisms that reduce these risks.
Adverse reproductive and developmental outcomes may have genetic, environmental, nutritional or epigenetic causes. Accordingly, the journal Birth Defects Research takes an integrated, multidisciplinary approach in its organization and publication strategy. The journal Birth Defects Research contains separate sections for clinical and molecular teratology, developmental and reproductive toxicology, and reviews in developmental biology to acknowledge and accommodate the integrative nature of research in this field. Each section has a dedicated editor who is a leader in his/her field and who has full editorial authority in his/her area.