使用非结构化电子健康记录识别腹主动脉瘤修复的自然语言处理框架。

IF 3.9 2区综合性期刊 Q1 MULTIDISCIPLINARY SCIENCES

Scientific Reports Pub Date : 2025-07-21 DOI:10.1038/s41598-025-11870-6

Daniel C Thompson, Reza Mofidi

{"title":"使用非结构化电子健康记录识别腹主动脉瘤修复的自然语言处理框架。","authors":"Daniel C Thompson, Reza Mofidi","doi":"10.1038/s41598-025-11870-6","DOIUrl":null,"url":null,"abstract":"Patient identification for national registries often relies upon clinician recognition of cases or retrospective searches using potentially inaccurate clinical codes, leading to incomplete data capture and inefficiencies. Natural Language Processing (NLP) offers a promising solution by automating analysis of electronic health records (EHRs). This study aimed to develop NLP models for identifying and classifying abdominal aortic aneurysm (AAA) repairs from unstructured EHRs, demonstrating a proof-of-concept for automated patient identification in registries like the National Vascular Registry. Using the MIMIC-IV-Note dataset, a multi-tiered approach was developed to identify vascular patients (Task 1), AAA repairs (Task 2), and classify repairs as primary or revision (Task 3). Four NLP models were trained and evaluated using 4870 annotated records: scispaCy, BERT-base, Bio-clinicalBERT, and a scispaCy/Bio-clinicalBERT ensemble. Models were compared using accuracy, precision, recall, F1-score, and area under the receiver operating characteristic curve (AUC). The scispaCy model demonstrated the fastest training (2 min/epoch) and inference times (2.87 samples/sec). For Task 1, scispaCy and ensemble models achieved the highest accuracy (0.97). In Task 2, all models performed exceptionally well, with ensemble, scispaCy, and Bio-clinicalBERT models achieving 0.99 accuracy and 1.00 AUC. For Task 3, Bio-clinicalBERT and the ensemble model achieved an AUC of 1.00, with Bio-clinicalBERT displaying the best overall accuracy (0.98). This study demonstrates that NLP models can accurately identify and classify AAA repair cases from unstructured EHRs, suggesting significant potential for automating patient identification in vascular surgery and other medical registries, reducing administra.","PeriodicalId":21811,"journal":{"name":"Scientific Reports","volume":"15 1","pages":"26388"},"PeriodicalIF":3.9000,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12280078/pdf/","citationCount":"0","resultStr":"{\"title\":\"Natural Language Processing framework for identifying abdominal aortic aneurysm repairs using unstructured electronic health records.\",\"authors\":\"Daniel C Thompson, Reza Mofidi\",\"doi\":\"10.1038/s41598-025-11870-6\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Patient identification for national registries often relies upon clinician recognition of cases or retrospective searches using potentially inaccurate clinical codes, leading to incomplete data capture and inefficiencies. Natural Language Processing (NLP) offers a promising solution by automating analysis of electronic health records (EHRs). This study aimed to develop NLP models for identifying and classifying abdominal aortic aneurysm (AAA) repairs from unstructured EHRs, demonstrating a proof-of-concept for automated patient identification in registries like the National Vascular Registry. Using the MIMIC-IV-Note dataset, a multi-tiered approach was developed to identify vascular patients (Task 1), AAA repairs (Task 2), and classify repairs as primary or revision (Task 3). Four NLP models were trained and evaluated using 4870 annotated records: scispaCy, BERT-base, Bio-clinicalBERT, and a scispaCy/Bio-clinicalBERT ensemble. Models were compared using accuracy, precision, recall, F1-score, and area under the receiver operating characteristic curve (AUC). The scispaCy model demonstrated the fastest training (2 min/epoch) and inference times (2.87 samples/sec). For Task 1, scispaCy and ensemble models achieved the highest accuracy (0.97). In Task 2, all models performed exceptionally well, with ensemble, scispaCy, and Bio-clinicalBERT models achieving 0.99 accuracy and 1.00 AUC. For Task 3, Bio-clinicalBERT and the ensemble model achieved an AUC of 1.00, with Bio-clinicalBERT displaying the best overall accuracy (0.98). This study demonstrates that NLP models can accurately identify and classify AAA repair cases from unstructured EHRs, suggesting significant potential for automating patient identification in vascular surgery and other medical registries, reducing administra.\",\"PeriodicalId\":21811,\"journal\":{\"name\":\"Scientific Reports\",\"volume\":\"15 1\",\"pages\":\"26388\"},\"PeriodicalIF\":3.9000,\"publicationDate\":\"2025-07-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12280078/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Scientific Reports\",\"FirstCategoryId\":\"103\",\"ListUrlMain\":\"https://doi.org/10.1038/s41598-025-11870-6\",\"RegionNum\":2,\"RegionCategory\":\"综合性期刊\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"MULTIDISCIPLINARY SCIENCES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Scientific Reports","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.1038/s41598-025-11870-6","RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

摘要

国家登记处的患者识别通常依赖于临床医生对病例的识别或使用可能不准确的临床代码进行回顾性搜索，从而导致数据捕获不完整和效率低下。自然语言处理（NLP）通过自动分析电子健康记录（EHRs）提供了一个很有前途的解决方案。本研究旨在开发用于从非结构化电子病历中识别和分类腹主动脉瘤（AAA）修复的NLP模型，证明在国家血管登记处等登记处进行自动患者识别的概念验证。使用MIMIC-IV-Note数据集，开发了一种多层方法来识别血管患者（任务1），AAA修复（任务2），并将修复分为初级修复或修订修复（任务3）。使用4870条注释记录训练和评估了四个NLP模型：scispaCy、BERT-base、Bio-clinicalBERT和一个scispaCy/Bio-clinicalBERT集合。比较模型的准确度、精密度、召回率、f1评分和受试者工作特征曲线下面积。该模型具有最快的训练速度（2 min/epoch）和最快的推理速度（2.87个样本/秒）。对于Task 1， scispaCy和ensemble模型的准确率最高（0.97）。在任务2中，所有模型都表现得非常好，其中ensemble、scispaCy和Bio-clinicalBERT模型的准确率为0.99，AUC为1.00。对于Task 3， Bio-clinicalBERT和集成模型的AUC为1.00，其中Bio-clinicalBERT整体准确率最高（0.98）。该研究表明，NLP模型可以准确地从非结构化的电子病历中识别和分类AAA修复病例，这表明在血管外科和其他医疗登记中自动化患者识别的巨大潜力，减少了行政管理。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Natural Language Processing framework for identifying abdominal aortic aneurysm repairs using unstructured electronic health records.

查看原文本刊更多论文

Natural Language Processing framework for identifying abdominal aortic aneurysm repairs using unstructured electronic health records.

Patient identification for national registries often relies upon clinician recognition of cases or retrospective searches using potentially inaccurate clinical codes, leading to incomplete data capture and inefficiencies. Natural Language Processing (NLP) offers a promising solution by automating analysis of electronic health records (EHRs). This study aimed to develop NLP models for identifying and classifying abdominal aortic aneurysm (AAA) repairs from unstructured EHRs, demonstrating a proof-of-concept for automated patient identification in registries like the National Vascular Registry. Using the MIMIC-IV-Note dataset, a multi-tiered approach was developed to identify vascular patients (Task 1), AAA repairs (Task 2), and classify repairs as primary or revision (Task 3). Four NLP models were trained and evaluated using 4870 annotated records: scispaCy, BERT-base, Bio-clinicalBERT, and a scispaCy/Bio-clinicalBERT ensemble. Models were compared using accuracy, precision, recall, F1-score, and area under the receiver operating characteristic curve (AUC). The scispaCy model demonstrated the fastest training (2 min/epoch) and inference times (2.87 samples/sec). For Task 1, scispaCy and ensemble models achieved the highest accuracy (0.97). In Task 2, all models performed exceptionally well, with ensemble, scispaCy, and Bio-clinicalBERT models achieving 0.99 accuracy and 1.00 AUC. For Task 3, Bio-clinicalBERT and the ensemble model achieved an AUC of 1.00, with Bio-clinicalBERT displaying the best overall accuracy (0.98). This study demonstrates that NLP models can accurately identify and classify AAA repair cases from unstructured EHRs, suggesting significant potential for automating patient identification in vascular surgery and other medical registries, reducing administra.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Scientific Reports Natural Science Disciplines-

CiteScore

7.50

自引率

4.30%

发文量

19567

审稿时长

3.9 months

期刊介绍： We publish original research from all areas of the natural sciences, psychology, medicine and engineering. You can learn more about what we publish by browsing our specific scientific subject areas below or explore Scientific Reports by browsing all articles and collections. Scientific Reports has a 2-year impact factor: 4.380 (2021), and is the 6th most-cited journal in the world, with more than 540,000 citations in 2020 (Clarivate Analytics, 2021). •Engineering Engineering covers all aspects of engineering, technology, and applied science. It plays a crucial role in the development of technologies to address some of the world''s biggest challenges, helping to save lives and improve the way we live. •Physical sciences Physical sciences are those academic disciplines that aim to uncover the underlying laws of nature — often written in the language of mathematics. It is a collective term for areas of study including astronomy, chemistry, materials science and physics. •Earth and environmental sciences Earth and environmental sciences cover all aspects of Earth and planetary science and broadly encompass solid Earth processes, surface and atmospheric dynamics, Earth system history, climate and climate change, marine and freshwater systems, and ecology. It also considers the interactions between humans and these systems. •Biological sciences Biological sciences encompass all the divisions of natural sciences examining various aspects of vital processes. The concept includes anatomy, physiology, cell biology, biochemistry and biophysics, and covers all organisms from microorganisms, animals to plants. •Health sciences The health sciences study health, disease and healthcare. This field of study aims to develop knowledge, interventions and technology for use in healthcare to improve the treatment of patients.