A large language model improves clinicians’ diagnostic performance in complex critical illness cases

IF 8.8 1区医学 Q1 CRITICAL CARE MEDICINE

Critical Care Pub Date : 2025-06-06 DOI:10.1186/s13054-025-05468-7

Xintong Wu, Yu Huang, Qing He

{"title":"A large language model improves clinicians’ diagnostic performance in complex critical illness cases","authors":"Xintong Wu, Yu Huang, Qing He","doi":"10.1186/s13054-025-05468-7","DOIUrl":null,"url":null,"abstract":"Large language models (LLMs) have demonstrated potential in assisting clinical decision-making. However, studies evaluating LLMs’ diagnostic performance on complex critical illness cases are lacking. We aimed to assess the diagnostic accuracy and response quality of an artificial intelligence (AI) model, and evaluate its potential benefits in assisting critical care residents with differential diagnosis of complex cases. This prospective comparative study collected challenging critical illness cases from the literature. Critical care residents from tertiary teaching hospitals were recruited and randomly assigned to non-AI-assisted physician and AI-assisted physician groups. We selected a reasoning model, DeepSeek-R1, for our study. We evaluated the model’s response quality using Likert scales, and we compared the diagnostic accuracy and efficiency between groups. A total of 48 cases were included. Thirty-two critical care residents were recruited, with 16 residents assigned to each group. Each resident handled an average of 3 cases. DeepSeek-R1’s responses received median Likert grades of 4.0 (IQR 4.0–5.0; 95% CI 4.0–4.5) for completeness, 5.0 (IQR 4.0–5.0; 95% CI 4.5–5.0) for clarity, and 5.0 (IQR 4.0–5.0; 95% CI 4.0–5.0) for usefulness. The AI model’s top diagnosis accuracy was 60% (29/48; 95% CI 0.456–0.729), with a median differential diagnosis quality score of 5.0 (IQR 4.0–5.0; 95% CI 4.5–5.0). Top diagnosis accuracy was 27% (13/48; 95% CI 0.146–0.396) in the non-AI-assisted physician group versus 58% (28/48; 95% CI 0.438–0.729) in the AI-assisted physician group. Median differential quality scores were 3.0 (IQR 0–5.0; 95% CI 2.0–4.0) without and 5.0 (IQR 3.0–5.0; 95% CI 3.0–5.0) with AI assistance. The AI model showed higher diagnostic accuracy than residents, and AI assistance significantly improved residents’ accuracy. The residents’ diagnostic time significantly decreased with AI assistance (median, 972 s; IQR 570–1320; 95% CI 675–1200) versus without (median, 1920 s; IQR 1320–2640; 95% CI 1710–2370). For diagnostically difficult critical illness cases, DeepSeek-R1 generates high-quality information, achieves reasonable diagnostic accuracy, and significantly improves residents’ diagnostic accuracy and efficiency. Reasoning models are suggested to be promising diagnostic adjuncts in intensive care units.","PeriodicalId":10811,"journal":{"name":"Critical Care","volume":"42 1","pages":""},"PeriodicalIF":8.8000,"publicationDate":"2025-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Critical Care","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s13054-025-05468-7","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CRITICAL CARE MEDICINE","Score":null,"Total":0}

引用次数: 0

Abstract

Large language models (LLMs) have demonstrated potential in assisting clinical decision-making. However, studies evaluating LLMs’ diagnostic performance on complex critical illness cases are lacking. We aimed to assess the diagnostic accuracy and response quality of an artificial intelligence (AI) model, and evaluate its potential benefits in assisting critical care residents with differential diagnosis of complex cases. This prospective comparative study collected challenging critical illness cases from the literature. Critical care residents from tertiary teaching hospitals were recruited and randomly assigned to non-AI-assisted physician and AI-assisted physician groups. We selected a reasoning model, DeepSeek-R1, for our study. We evaluated the model’s response quality using Likert scales, and we compared the diagnostic accuracy and efficiency between groups. A total of 48 cases were included. Thirty-two critical care residents were recruited, with 16 residents assigned to each group. Each resident handled an average of 3 cases. DeepSeek-R1’s responses received median Likert grades of 4.0 (IQR 4.0–5.0; 95% CI 4.0–4.5) for completeness, 5.0 (IQR 4.0–5.0; 95% CI 4.5–5.0) for clarity, and 5.0 (IQR 4.0–5.0; 95% CI 4.0–5.0) for usefulness. The AI model’s top diagnosis accuracy was 60% (29/48; 95% CI 0.456–0.729), with a median differential diagnosis quality score of 5.0 (IQR 4.0–5.0; 95% CI 4.5–5.0). Top diagnosis accuracy was 27% (13/48; 95% CI 0.146–0.396) in the non-AI-assisted physician group versus 58% (28/48; 95% CI 0.438–0.729) in the AI-assisted physician group. Median differential quality scores were 3.0 (IQR 0–5.0; 95% CI 2.0–4.0) without and 5.0 (IQR 3.0–5.0; 95% CI 3.0–5.0) with AI assistance. The AI model showed higher diagnostic accuracy than residents, and AI assistance significantly improved residents’ accuracy. The residents’ diagnostic time significantly decreased with AI assistance (median, 972 s; IQR 570–1320; 95% CI 675–1200) versus without (median, 1920 s; IQR 1320–2640; 95% CI 1710–2370). For diagnostically difficult critical illness cases, DeepSeek-R1 generates high-quality information, achieves reasonable diagnostic accuracy, and significantly improves residents’ diagnostic accuracy and efficiency. Reasoning models are suggested to be promising diagnostic adjuncts in intensive care units.

查看原文本刊更多论文

大型语言模型提高了临床医生在复杂重症病例中的诊断性能

大型语言模型（LLMs）在辅助临床决策方面已经显示出潜力。然而，评价llm对复杂重症病例的诊断性能的研究缺乏。我们旨在评估人工智能（AI）模型的诊断准确性和响应质量，并评估其在协助重症监护居民鉴别诊断复杂病例方面的潜在益处。这项前瞻性比较研究从文献中收集了具有挑战性的危重病例。从三级教学医院招募重症监护居民，并随机分配到非人工智能辅助医生组和人工智能辅助医生组。我们选择了一个推理模型DeepSeek-R1进行研究。我们使用李克特量表评估模型的反应质量，并比较各组之间的诊断准确性和效率。共纳入48例。招募了32名重症监护住院医生，每组16名住院医生。每位住院医师平均处理3个病例。DeepSeek-R1的应答中位Likert评分为4.0 (IQR 4.0 - 5.0；95% CI 4.0-4.5)为完备性，5.0 (IQR 4.0-5.0；95% CI为4.5-5.0)，IQR为5.0 (IQR 4.0-5.0；95% CI 4.0-5.0)。人工智能模型的最高诊断准确率为60% (29/48；95% CI 0.456-0.729)，中位鉴别诊断质量评分为5.0 (IQR 4.0-5.0；95% ci 4.5-5.0)。最高诊断准确率为27% (13/48；95% CI 0.146-0.396)，而非人工智能辅助医生组为58% (28/48；95% CI 0.438-0.729)。中位差异质量评分为3.0分(IQR 0-5.0；95% CI 2.0-4.0)无和5.0 (IQR 3.0-5.0；95% CI 3.0-5.0)。人工智能模型的诊断准确率高于居民，人工智能辅助显著提高了居民的诊断准确率。在人工智能辅助下，居民的诊断时间显著缩短(中位数为972 s；差570 - 1320;95% CI 675-1200)与无(中位数，1920年；差1320 - 2640;95% ci 1710-2370)。对于诊断困难的重症病例，DeepSeek-R1能够生成高质量的信息，实现合理的诊断准确率，显著提高住院医师的诊断准确率和诊断效率。推理模型被认为是有前途的诊断辅助在重症监护病房。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Critical Care 医学-危重病医学

CiteScore

20.60

自引率

3.30%

发文量

348

审稿时长

1.5 months

期刊介绍： Critical Care is an esteemed international medical journal that undergoes a rigorous peer-review process to maintain its high quality standards. Its primary objective is to enhance the healthcare services offered to critically ill patients. To achieve this, the journal focuses on gathering, exchanging, disseminating, and endorsing evidence-based information that is highly relevant to intensivists. By doing so, Critical Care seeks to provide a thorough and inclusive examination of the intensive care field.