ChatGPT、Gemini和DeepSeek在急诊科使用真实对话进行非关键分诊支持的性能。

IF 2.3 3区医学 Q1 EMERGENCY MEDICINE

BMC Emergency Medicine Pub Date : 2025-09-01 DOI:10.1186/s12873-025-01337-2

Sukyo Lee, Sumin Jung, Jong-Hak Park, Hanjin Cho, Sungwoo Moon, Sejoong Ahn

{"title":"ChatGPT、Gemini和DeepSeek在急诊科使用真实对话进行非关键分诊支持的性能。","authors":"Sukyo Lee, Sumin Jung, Jong-Hak Park, Hanjin Cho, Sungwoo Moon, Sejoong Ahn","doi":"10.1186/s12873-025-01337-2","DOIUrl":null,"url":null,"abstract":"Background: Timely and accurate triage is crucial for the emergency department (ED) care. Recently, there has been growing interest in applying large language models (LLMs) to support triage decision-making. However, most existing studies have evaluated these models using simulated scenarios rather than real-world clinical cases. Therefore, we evaluated the performance of multiple commercial LLMs for non-critical triage support in ED using real-world clinical conversations.Methods: We retrospectively analyzed real-world triage conversations prospectively collected from three tertiary hospitals in South Korea. Multiple commercial LLMs-including OpenAI GPT-4o, GPT-4.1, O3, Google Gemini 2.0 flash, Gemini 2.5 flash, Gemini 2.5 pro, DeepSeek V3, and DeepSeek R1-were evaluated for the accuracy in triaging patient urgency based solely on unsummarized dialogue. The Korean Triage and Acuity Scale (KTAS) assigned by triage nurses was used as the gold standard for evaluating the LLM classifications. Model performance was assessed under both a zero-shot prompting condition and a few-shot prompting condition that included representative examples.Results: A total of 1,057 triage cases were included in the analysis. Among the models, Gemini 2.5 flash achieved the highest accuracy (73.8%), specificity (88.9%), and PPV (94.0%). Gemini 2.5 pro demonstrated the highest sensitivity (90.9%) and F1-score (82.4%), though with lower specificity (23.3%). GPT-4.1 also showed balanced high accuracy (70.6%) and sensitivity (81.3%) with practical response times (1.79s). Performance varied widely between models and even between different versions from the same vendor. With few-shot prompting, most models showed further improvements in accuracy and F1-score.Conclusions: LLMs can accurately triage ED patient urgency using real-world clinical conversations. Several models demonstrated both high sensitivity and acceptable response times, supporting the feasibility of LLM in non-critical triage support tools in diverse clinical environments. These findings apply to non-critical patients (KTAS 3-5), and further research should address integration with objective clinical data and real-time workflow.","PeriodicalId":9002,"journal":{"name":"BMC Emergency Medicine","volume":"25 1","pages":"176"},"PeriodicalIF":2.3000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12403343/pdf/","citationCount":"0","resultStr":"{\"title\":\"Performance of ChatGPT, Gemini and DeepSeek for non-critical triage support using real-world conversations in emergency department.\",\"authors\":\"Sukyo Lee, Sumin Jung, Jong-Hak Park, Hanjin Cho, Sungwoo Moon, Sejoong Ahn\",\"doi\":\"10.1186/s12873-025-01337-2\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Timely and accurate triage is crucial for the emergency department (ED) care. Recently, there has been growing interest in applying large language models (LLMs) to support triage decision-making. However, most existing studies have evaluated these models using simulated scenarios rather than real-world clinical cases. Therefore, we evaluated the performance of multiple commercial LLMs for non-critical triage support in ED using real-world clinical conversations.Methods: We retrospectively analyzed real-world triage conversations prospectively collected from three tertiary hospitals in South Korea. Multiple commercial LLMs-including OpenAI GPT-4o, GPT-4.1, O3, Google Gemini 2.0 flash, Gemini 2.5 flash, Gemini 2.5 pro, DeepSeek V3, and DeepSeek R1-were evaluated for the accuracy in triaging patient urgency based solely on unsummarized dialogue. The Korean Triage and Acuity Scale (KTAS) assigned by triage nurses was used as the gold standard for evaluating the LLM classifications. Model performance was assessed under both a zero-shot prompting condition and a few-shot prompting condition that included representative examples.Results: A total of 1,057 triage cases were included in the analysis. Among the models, Gemini 2.5 flash achieved the highest accuracy (73.8%), specificity (88.9%), and PPV (94.0%). Gemini 2.5 pro demonstrated the highest sensitivity (90.9%) and F1-score (82.4%), though with lower specificity (23.3%). GPT-4.1 also showed balanced high accuracy (70.6%) and sensitivity (81.3%) with practical response times (1.79s). Performance varied widely between models and even between different versions from the same vendor. With few-shot prompting, most models showed further improvements in accuracy and F1-score.Conclusions: LLMs can accurately triage ED patient urgency using real-world clinical conversations. Several models demonstrated both high sensitivity and acceptable response times, supporting the feasibility of LLM in non-critical triage support tools in diverse clinical environments. These findings apply to non-critical patients (KTAS 3-5), and further research should address integration with objective clinical data and real-time workflow.\",\"PeriodicalId\":9002,\"journal\":{\"name\":\"BMC Emergency Medicine\",\"volume\":\"25 1\",\"pages\":\"176\"},\"PeriodicalIF\":2.3000,\"publicationDate\":\"2025-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12403343/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BMC Emergency Medicine\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1186/s12873-025-01337-2\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"EMERGENCY MEDICINE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Emergency Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12873-025-01337-2","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EMERGENCY MEDICINE","Score":null,"Total":0}

引用次数: 0

摘要

背景：及时准确的分诊对急诊科（ED）护理至关重要。最近，人们对应用大型语言模型（llm）来支持分类决策越来越感兴趣。然而，大多数现有研究使用模拟情景而不是真实的临床病例来评估这些模型。因此，我们利用现实世界的临床对话评估了多个商业llm在急诊科非关键分诊支持方面的表现。方法：我们回顾性分析了从韩国三家三级医院前瞻性收集的现实世界的分诊谈话。多个商业llms（包括OpenAI gpt - 40、GPT-4.1、O3、谷歌Gemini 2.0 flash、Gemini 2.5 flash、Gemini 2.5 pro、DeepSeek V3和DeepSeek r1）仅基于未总结的对话来评估患者紧急程度的准确性。由分诊护士分配的韩国分诊和敏锐度量表（KTAS）被用作评估LLM分类的金标准。在零次提示和包含代表性示例的少次提示条件下评估模型的性能。结果：共纳入1057例分诊病例分析。其中Gemini 2.5 flash的准确率最高（73.8%），特异性最高（88.9%），PPV最高（94.0%）。Gemini 2.5 pro具有最高的敏感性（90.9%）和f1评分（82.4%），但特异性较低（23.3%）。GPT-4.1还显示出高精度（70.6%）和灵敏度（81.3%），实际响应时间（1.79s）。在不同的模型之间，甚至在来自同一供应商的不同版本之间，性能差异很大。在较少的提示下，大多数模型的准确率和f1分数都有了进一步的提高。结论：llm可以通过真实的临床对话准确地判断急诊科患者的紧急程度。有几个模型显示出高灵敏度和可接受的响应时间，支持LLM在不同临床环境中作为非关键分诊支持工具的可行性。这些发现适用于非危重患者（KTAS 3-5），进一步的研究应解决与客观临床数据和实时工作流程的整合。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Performance of ChatGPT, Gemini and DeepSeek for non-critical triage support using real-world conversations in emergency department.

查看原文本刊更多论文

Performance of ChatGPT, Gemini and DeepSeek for non-critical triage support using real-world conversations in emergency department.

Background: Timely and accurate triage is crucial for the emergency department (ED) care. Recently, there has been growing interest in applying large language models (LLMs) to support triage decision-making. However, most existing studies have evaluated these models using simulated scenarios rather than real-world clinical cases. Therefore, we evaluated the performance of multiple commercial LLMs for non-critical triage support in ED using real-world clinical conversations.

Methods: We retrospectively analyzed real-world triage conversations prospectively collected from three tertiary hospitals in South Korea. Multiple commercial LLMs-including OpenAI GPT-4o, GPT-4.1, O3, Google Gemini 2.0 flash, Gemini 2.5 flash, Gemini 2.5 pro, DeepSeek V3, and DeepSeek R1-were evaluated for the accuracy in triaging patient urgency based solely on unsummarized dialogue. The Korean Triage and Acuity Scale (KTAS) assigned by triage nurses was used as the gold standard for evaluating the LLM classifications. Model performance was assessed under both a zero-shot prompting condition and a few-shot prompting condition that included representative examples.

Results: A total of 1,057 triage cases were included in the analysis. Among the models, Gemini 2.5 flash achieved the highest accuracy (73.8%), specificity (88.9%), and PPV (94.0%). Gemini 2.5 pro demonstrated the highest sensitivity (90.9%) and F1-score (82.4%), though with lower specificity (23.3%). GPT-4.1 also showed balanced high accuracy (70.6%) and sensitivity (81.3%) with practical response times (1.79s). Performance varied widely between models and even between different versions from the same vendor. With few-shot prompting, most models showed further improvements in accuracy and F1-score.

Conclusions: LLMs can accurately triage ED patient urgency using real-world clinical conversations. Several models demonstrated both high sensitivity and acceptable response times, supporting the feasibility of LLM in non-critical triage support tools in diverse clinical environments. These findings apply to non-critical patients (KTAS 3-5), and further research should address integration with objective clinical data and real-time workflow.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

BMC Emergency Medicine Medicine-Emergency Medicine

CiteScore

3.50

自引率

8.00%

发文量

178

审稿时长

29 weeks

期刊介绍： BMC Emergency Medicine is an open access, peer-reviewed journal that considers articles on all urgent and emergency aspects of medicine, in both practice and basic research. In addition, the journal covers aspects of disaster medicine and medicine in special locations, such as conflict areas and military medicine, together with articles concerning healthcare services in the emergency departments.