Evaluating large language models for drafting emergency department encounter summaries.

PLOS digital health Pub Date : 2025-06-17 eCollection Date: 2025-06-01 DOI:10.1371/journal.pdig.0000899

Christopher Y K Williams, Jaskaran Bains, Tianyu Tang, Kishan Patel, Alexa N Lucas, Fiona Chen, Brenda Y Miao, Atul J Butte, Aaron E Kornblith

{"title":"Evaluating large language models for drafting emergency department encounter summaries.","authors":"Christopher Y K Williams, Jaskaran Bains, Tianyu Tang, Kishan Patel, Alexa N Lucas, Fiona Chen, Brenda Y Miao, Atul J Butte, Aaron E Kornblith","doi":"10.1371/journal.pdig.0000899","DOIUrl":null,"url":null,"abstract":"<p><p>Large language models (LLMs) possess a range of capabilities which may be applied to the clinical domain, including text summarization. As ambient artificial intelligence scribes and other LLM-based tools begin to be deployed within healthcare settings, rigorous evaluations of the accuracy of these technologies are urgently needed. In this cross-sectional study of 100 randomly sampled adult Emergency Department (ED) visits from 2012 to 2023 at the University of California, San Francisco ED, we sought to investigate the performance of GPT-4 and GPT-3.5-turbo in generating ED encounter summaries and evaluate the prevalence and type of errors for each section of the encounter summary across three evaluation criteria: 1) Inaccuracy of LLM-summarized information; 2) Hallucination of information; 3) Omission of relevant clinical information. In total, 33% of summaries generated by GPT-4 and 10% of those generated by GPT-3.5-turbo were entirely error-free across all evaluated domains. Summaries generated by GPT-4 were mostly accurate, with inaccuracies found in only 10% of cases, however, 42% of the summaries exhibited hallucinations and 47% omitted clinically relevant information. Inaccuracies and hallucinations were most commonly found in the Plan sections of LLM-generated summaries, while clinical omissions were concentrated in text describing patients' Physical Examination findings or History of Presenting Complaint. The potential harmfulness score across errors was low, with a mean score of 0.57 (SD 1.11) out of 7 and only three errors scoring 4 ('Potential for permanent harm') or greater. In summary, we found that LLMs could generate accurate encounter summaries but were liable to hallucination and omission of clinically relevant information. Individual errors on average had a low potential for harm. A comprehensive understanding of the location and type of errors found in LLM-generated clinical text is important to facilitate clinician review of such content and prevent patient harm.</p>","PeriodicalId":74465,"journal":{"name":"PLOS digital health","volume":"4 6","pages":"e0000899"},"PeriodicalIF":0.0000,"publicationDate":"2025-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12173386/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLOS digital health","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1371/journal.pdig.0000899","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/6/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Large language models (LLMs) possess a range of capabilities which may be applied to the clinical domain, including text summarization. As ambient artificial intelligence scribes and other LLM-based tools begin to be deployed within healthcare settings, rigorous evaluations of the accuracy of these technologies are urgently needed. In this cross-sectional study of 100 randomly sampled adult Emergency Department (ED) visits from 2012 to 2023 at the University of California, San Francisco ED, we sought to investigate the performance of GPT-4 and GPT-3.5-turbo in generating ED encounter summaries and evaluate the prevalence and type of errors for each section of the encounter summary across three evaluation criteria: 1) Inaccuracy of LLM-summarized information; 2) Hallucination of information; 3) Omission of relevant clinical information. In total, 33% of summaries generated by GPT-4 and 10% of those generated by GPT-3.5-turbo were entirely error-free across all evaluated domains. Summaries generated by GPT-4 were mostly accurate, with inaccuracies found in only 10% of cases, however, 42% of the summaries exhibited hallucinations and 47% omitted clinically relevant information. Inaccuracies and hallucinations were most commonly found in the Plan sections of LLM-generated summaries, while clinical omissions were concentrated in text describing patients' Physical Examination findings or History of Presenting Complaint. The potential harmfulness score across errors was low, with a mean score of 0.57 (SD 1.11) out of 7 and only three errors scoring 4 ('Potential for permanent harm') or greater. In summary, we found that LLMs could generate accurate encounter summaries but were liable to hallucination and omission of clinically relevant information. Individual errors on average had a low potential for harm. A comprehensive understanding of the location and type of errors found in LLM-generated clinical text is important to facilitate clinician review of such content and prevent patient harm.

查看原文本刊更多论文

评估用于起草急诊科事故摘要的大型语言模型。

大型语言模型（llm）拥有一系列可以应用于临床领域的功能，包括文本摘要。随着环境人工智能抄写器和其他基于法学硕士的工具开始在医疗保健环境中部署，迫切需要对这些技术的准确性进行严格评估。在这项横断面研究中，我们对2012年至2023年加州大学旧金山分校急诊科100例随机抽样的成人急诊科（ED）就诊进行了研究，我们试图调查GPT-4和gpt -3.5 turbo在生成急诊科就诊摘要方面的表现，并通过三个评估标准评估就诊摘要每个部分的患病率和错误类型：1)llm总结信息的不准确性；2)信息幻觉；3)遗漏相关临床资料。总的来说，GPT-4生成的总结中有33%和GPT-3.5-turbo生成的总结中有10%在所有评估域中完全没有错误。GPT-4生成的总结大部分是准确的，只有10%的病例发现不准确，然而，42%的总结表现出幻觉，47%的总结遗漏了临床相关信息。不准确和幻觉最常见于llm生成的摘要的计划部分，而临床遗漏主要集中在描述患者体检结果或主诉史的文本中。所有错误的潜在危害得分很低，平均得分为0.57 (SD 1.11)，满分为7分，只有3个错误得分为4（“潜在永久伤害”）或更高。综上所述，我们发现llm可以生成准确的遭遇总结，但容易产生幻觉和遗漏临床相关信息。平均而言，单个错误的潜在危害很小。全面了解法学硕士生成的临床文本中发现的错误的位置和类型对于促进临床医生对此类内容的审查和防止患者伤害非常重要。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

PLOS digital health

自引率

0.00%

发文量