Kevin Yuan, Chang Ho Yoon, Qingze Gu, Henry Munby, A Sarah Walker, Tingting Zhu, David W Eyre
{"title":"Transformers and large language models are efficient feature extractors for electronic health record studies.","authors":"Kevin Yuan, Chang Ho Yoon, Qingze Gu, Henry Munby, A Sarah Walker, Tingting Zhu, David W Eyre","doi":"10.1038/s43856-025-00790-1","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Free-text data is abundant in electronic health records, but challenges in accurate and scalable information extraction mean less specific clinical codes are often used instead.</p><p><strong>Methods: </strong>We evaluated the efficacy of feature extraction using modern natural language processing methods (NLP) and large language models (LLMs) on 938,150 hospital antibiotic prescriptions from Oxfordshire, UK. Specifically, we investigated inferring the type(s) of infection from a free-text \"indication\" field, where clinicians state the reason for prescribing antibiotics. Clinical researchers labelled a subset of the 4000 most frequent unique indications (representing 692,310 prescriptions) into 11 categories describing the infection source or clinical syndrome. Various models were then trained to determine the binary presence/absence of these infection types and also any uncertainty expressed by clinicians.</p><p><strong>Results: </strong>We show on separate internal (n = 2000 prescriptions) and external test datasets (n = 2000 prescriptions), a fine-tuned domain-specific Bio+Clinical BERT model performs best across the 11 categories (average F1 score 0.97 and 0.98 respectively) and outperforms traditional regular expression (F1 = 0.71 and 0.74) and n-grams/XGBoost (F1 = 0.86 and 0.84) models. A zero-shot OpenAI GPT4 model matches the performance of traditional NLP models without the need for labelled training data (F1 = 0.71 and 0.86) and a fine-tuned GPT3.5 model achieves similar performance to the fine-tuned BERT-based model (F1 = 0.95 and 0.97). Infection sources obtained from free-text indications reveal specific infection sources 31% more often than ICD-10 codes.</p><p><strong>Conclusions: </strong>Modern transformer-based models have the potential to be used widely throughout medicine to extract information from structured free-text records, to facilitate better research and patient care.</p>","PeriodicalId":72646,"journal":{"name":"Communications medicine","volume":"5 1","pages":"83"},"PeriodicalIF":5.4000,"publicationDate":"2025-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11928488/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Communications medicine","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1038/s43856-025-00790-1","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MEDICINE, RESEARCH & EXPERIMENTAL","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Free-text data is abundant in electronic health records, but challenges in accurate and scalable information extraction mean less specific clinical codes are often used instead.
Methods: We evaluated the efficacy of feature extraction using modern natural language processing methods (NLP) and large language models (LLMs) on 938,150 hospital antibiotic prescriptions from Oxfordshire, UK. Specifically, we investigated inferring the type(s) of infection from a free-text "indication" field, where clinicians state the reason for prescribing antibiotics. Clinical researchers labelled a subset of the 4000 most frequent unique indications (representing 692,310 prescriptions) into 11 categories describing the infection source or clinical syndrome. Various models were then trained to determine the binary presence/absence of these infection types and also any uncertainty expressed by clinicians.
Results: We show on separate internal (n = 2000 prescriptions) and external test datasets (n = 2000 prescriptions), a fine-tuned domain-specific Bio+Clinical BERT model performs best across the 11 categories (average F1 score 0.97 and 0.98 respectively) and outperforms traditional regular expression (F1 = 0.71 and 0.74) and n-grams/XGBoost (F1 = 0.86 and 0.84) models. A zero-shot OpenAI GPT4 model matches the performance of traditional NLP models without the need for labelled training data (F1 = 0.71 and 0.86) and a fine-tuned GPT3.5 model achieves similar performance to the fine-tuned BERT-based model (F1 = 0.95 and 0.97). Infection sources obtained from free-text indications reveal specific infection sources 31% more often than ICD-10 codes.
Conclusions: Modern transformer-based models have the potential to be used widely throughout medicine to extract information from structured free-text records, to facilitate better research and patient care.