Evaluating large language models on hospital health data for automated emergency triage.

IF 2.3 3区医学 Q3 ENGINEERING, BIOMEDICAL

International Journal of Computer Assisted Radiology and Surgery Pub Date : 2025-07-16 DOI:10.1007/s11548-025-03475-1

Carlos Lafuente, Mehdi Rahim

{"title":"Evaluating large language models on hospital health data for automated emergency triage.","authors":"Carlos Lafuente, Mehdi Rahim","doi":"10.1007/s11548-025-03475-1","DOIUrl":null,"url":null,"abstract":"Purpose: Large language models (LLMs) have a significant potential in healthcare due to their ability to process unstructured text from electronic health records (EHRs) and to generate knowledge with few or no training. In this study, we investigate the effectiveness of LLMs for clinical decision support, specifically in the context of emergency department triage, where the volume of textual data is minimal compared to other scenarios such as making a clinical diagnosis.Methods: We benchmark LLMs with traditional machine learning (ML) approaches using the Emergency Severity Index (ESI) as the gold standard criteria of triage. The benchmark includes general purpose, specialised, and fine-tuned LLMs. All models are prompted to predict ESI score from a EHRs. We use a balanced subset (n = 1000) from MIMIC-IV-ED, a large database containing records of admissions to the emergency department of Beth Israel Deaconess Medical Center.Results: Our findings show that the best-performing models have an average F1-score below 0.60. Also, while zero-shot and fine-tuned LLMs can outperform standard ML models, their performance is surpassed by ML models augmented with features derived from LLMs or knowledge graphs.Conclusion: LLMs show value for clinical decision support in scenarios with limited textual data, such as emergency department triage. The study advocates for integrating LLM knowledge representation to improve existing ML models rather than using LLMs in isolation, suggesting this as a more promising approach to enhance the accuracy of automated triage systems.","PeriodicalId":51251,"journal":{"name":"International Journal of Computer Assisted Radiology and Surgery","volume":" ","pages":""},"PeriodicalIF":2.3000,"publicationDate":"2025-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Computer Assisted Radiology and Surgery","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1007/s11548-025-03475-1","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ENGINEERING, BIOMEDICAL","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose: Large language models (LLMs) have a significant potential in healthcare due to their ability to process unstructured text from electronic health records (EHRs) and to generate knowledge with few or no training. In this study, we investigate the effectiveness of LLMs for clinical decision support, specifically in the context of emergency department triage, where the volume of textual data is minimal compared to other scenarios such as making a clinical diagnosis.

Methods: We benchmark LLMs with traditional machine learning (ML) approaches using the Emergency Severity Index (ESI) as the gold standard criteria of triage. The benchmark includes general purpose, specialised, and fine-tuned LLMs. All models are prompted to predict ESI score from a EHRs. We use a balanced subset (n = 1000) from MIMIC-IV-ED, a large database containing records of admissions to the emergency department of Beth Israel Deaconess Medical Center.

Results: Our findings show that the best-performing models have an average F1-score below 0.60. Also, while zero-shot and fine-tuned LLMs can outperform standard ML models, their performance is surpassed by ML models augmented with features derived from LLMs or knowledge graphs.

Conclusion: LLMs show value for clinical decision support in scenarios with limited textual data, such as emergency department triage. The study advocates for integrating LLM knowledge representation to improve existing ML models rather than using LLMs in isolation, suggesting this as a more promising approach to enhance the accuracy of automated triage systems.

查看原文本刊更多论文

评估用于自动紧急分诊的医院健康数据的大型语言模型。

目的：大型语言模型（llm）在医疗保健领域具有巨大的潜力，因为它们能够处理电子健康记录（EHRs）中的非结构化文本，并在很少或没有培训的情况下生成知识。在本研究中，我们调查了llm在临床决策支持方面的有效性，特别是在急诊科分诊的背景下，与其他场景（如临床诊断）相比，文本数据的量是最小的。方法：我们使用传统的机器学习（ML）方法对llm进行基准测试，使用紧急程度指数（ESI）作为分类的金标准。基准测试包括通用、专用和微调的llm。所有模型都被提示从电子病历中预测ESI评分。我们使用来自MIMIC-IV-ED的平衡子集（n = 1000）， MIMIC-IV-ED是一个包含贝斯以色列女执事医疗中心急诊科入院记录的大型数据库。结果：我们的研究结果表明，表现最好的模型的平均f1得分低于0.60。此外，虽然零射击和微调llm可以优于标准ML模型，但它们的性能被从llm或知识图派生的特征增强的ML模型所超越。结论：llm在文本数据有限的情况下，如急诊科分诊，显示了临床决策支持的价值。该研究提倡整合LLM知识表示来改进现有的ML模型，而不是孤立地使用LLM，这表明这是一种更有希望提高自动分诊系统准确性的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Computer Assisted Radiology and Surgery ENGINEERING, BIOMEDICAL-RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING

CiteScore

5.90

自引率

6.70%

发文量

243

审稿时长

6-12 weeks

期刊介绍： The International Journal for Computer Assisted Radiology and Surgery (IJCARS) is a peer-reviewed journal that provides a platform for closing the gap between medical and technical disciplines, and encourages interdisciplinary research and development activities in an international environment.