A large language model improves clinicians’ diagnostic performance in complex critical illness cases

IF 8.8 1区 医学 Q1 CRITICAL CARE MEDICINE
Xintong Wu, Yu Huang, Qing He
{"title":"A large language model improves clinicians’ diagnostic performance in complex critical illness cases","authors":"Xintong Wu, Yu Huang, Qing He","doi":"10.1186/s13054-025-05468-7","DOIUrl":null,"url":null,"abstract":"Large language models (LLMs) have demonstrated potential in assisting clinical decision-making. However, studies evaluating LLMs’ diagnostic performance on complex critical illness cases are lacking. We aimed to assess the diagnostic accuracy and response quality of an artificial intelligence (AI) model, and evaluate its potential benefits in assisting critical care residents with differential diagnosis of complex cases. This prospective comparative study collected challenging critical illness cases from the literature. Critical care residents from tertiary teaching hospitals were recruited and randomly assigned to non-AI-assisted physician and AI-assisted physician groups. We selected a reasoning model, DeepSeek-R1, for our study. We evaluated the model’s response quality using Likert scales, and we compared the diagnostic accuracy and efficiency between groups. A total of 48 cases were included. Thirty-two critical care residents were recruited, with 16 residents assigned to each group. Each resident handled an average of 3 cases. DeepSeek-R1’s responses received median Likert grades of 4.0 (IQR 4.0–5.0; 95% CI 4.0–4.5) for completeness, 5.0 (IQR 4.0–5.0; 95% CI 4.5–5.0) for clarity, and 5.0 (IQR 4.0–5.0; 95% CI 4.0–5.0) for usefulness. The AI model’s top diagnosis accuracy was 60% (29/48; 95% CI 0.456–0.729), with a median differential diagnosis quality score of 5.0 (IQR 4.0–5.0; 95% CI 4.5–5.0). Top diagnosis accuracy was 27% (13/48; 95% CI 0.146–0.396) in the non-AI-assisted physician group versus 58% (28/48; 95% CI 0.438–0.729) in the AI-assisted physician group. Median differential quality scores were 3.0 (IQR 0–5.0; 95% CI 2.0–4.0) without and 5.0 (IQR 3.0–5.0; 95% CI 3.0–5.0) with AI assistance. The AI model showed higher diagnostic accuracy than residents, and AI assistance significantly improved residents’ accuracy. The residents’ diagnostic time significantly decreased with AI assistance (median, 972 s; IQR 570–1320; 95% CI 675–1200) versus without (median, 1920 s; IQR 1320–2640; 95% CI 1710–2370). For diagnostically difficult critical illness cases, DeepSeek-R1 generates high-quality information, achieves reasonable diagnostic accuracy, and significantly improves residents’ diagnostic accuracy and efficiency. Reasoning models are suggested to be promising diagnostic adjuncts in intensive care units.","PeriodicalId":10811,"journal":{"name":"Critical Care","volume":"42 1","pages":""},"PeriodicalIF":8.8000,"publicationDate":"2025-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Critical Care","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s13054-025-05468-7","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CRITICAL CARE MEDICINE","Score":null,"Total":0}
引用次数: 0

Abstract

Large language models (LLMs) have demonstrated potential in assisting clinical decision-making. However, studies evaluating LLMs’ diagnostic performance on complex critical illness cases are lacking. We aimed to assess the diagnostic accuracy and response quality of an artificial intelligence (AI) model, and evaluate its potential benefits in assisting critical care residents with differential diagnosis of complex cases. This prospective comparative study collected challenging critical illness cases from the literature. Critical care residents from tertiary teaching hospitals were recruited and randomly assigned to non-AI-assisted physician and AI-assisted physician groups. We selected a reasoning model, DeepSeek-R1, for our study. We evaluated the model’s response quality using Likert scales, and we compared the diagnostic accuracy and efficiency between groups. A total of 48 cases were included. Thirty-two critical care residents were recruited, with 16 residents assigned to each group. Each resident handled an average of 3 cases. DeepSeek-R1’s responses received median Likert grades of 4.0 (IQR 4.0–5.0; 95% CI 4.0–4.5) for completeness, 5.0 (IQR 4.0–5.0; 95% CI 4.5–5.0) for clarity, and 5.0 (IQR 4.0–5.0; 95% CI 4.0–5.0) for usefulness. The AI model’s top diagnosis accuracy was 60% (29/48; 95% CI 0.456–0.729), with a median differential diagnosis quality score of 5.0 (IQR 4.0–5.0; 95% CI 4.5–5.0). Top diagnosis accuracy was 27% (13/48; 95% CI 0.146–0.396) in the non-AI-assisted physician group versus 58% (28/48; 95% CI 0.438–0.729) in the AI-assisted physician group. Median differential quality scores were 3.0 (IQR 0–5.0; 95% CI 2.0–4.0) without and 5.0 (IQR 3.0–5.0; 95% CI 3.0–5.0) with AI assistance. The AI model showed higher diagnostic accuracy than residents, and AI assistance significantly improved residents’ accuracy. The residents’ diagnostic time significantly decreased with AI assistance (median, 972 s; IQR 570–1320; 95% CI 675–1200) versus without (median, 1920 s; IQR 1320–2640; 95% CI 1710–2370). For diagnostically difficult critical illness cases, DeepSeek-R1 generates high-quality information, achieves reasonable diagnostic accuracy, and significantly improves residents’ diagnostic accuracy and efficiency. Reasoning models are suggested to be promising diagnostic adjuncts in intensive care units.
大型语言模型提高了临床医生在复杂重症病例中的诊断性能
大型语言模型(LLMs)在辅助临床决策方面已经显示出潜力。然而,评价llm对复杂重症病例的诊断性能的研究缺乏。我们旨在评估人工智能(AI)模型的诊断准确性和响应质量,并评估其在协助重症监护居民鉴别诊断复杂病例方面的潜在益处。这项前瞻性比较研究从文献中收集了具有挑战性的危重病例。从三级教学医院招募重症监护居民,并随机分配到非人工智能辅助医生组和人工智能辅助医生组。我们选择了一个推理模型DeepSeek-R1进行研究。我们使用李克特量表评估模型的反应质量,并比较各组之间的诊断准确性和效率。共纳入48例。招募了32名重症监护住院医生,每组16名住院医生。每位住院医师平均处理3个病例。DeepSeek-R1的应答中位Likert评分为4.0 (IQR 4.0 - 5.0;95% CI 4.0-4.5)为完备性,5.0 (IQR 4.0-5.0;95% CI为4.5-5.0),IQR为5.0 (IQR 4.0-5.0;95% CI 4.0-5.0)。人工智能模型的最高诊断准确率为60% (29/48;95% CI 0.456-0.729),中位鉴别诊断质量评分为5.0 (IQR 4.0-5.0;95% ci 4.5-5.0)。最高诊断准确率为27% (13/48;95% CI 0.146-0.396),而非人工智能辅助医生组为58% (28/48;95% CI 0.438-0.729)。中位差异质量评分为3.0分(IQR 0-5.0;95% CI 2.0-4.0)无和5.0 (IQR 3.0-5.0;95% CI 3.0-5.0)。人工智能模型的诊断准确率高于居民,人工智能辅助显著提高了居民的诊断准确率。在人工智能辅助下,居民的诊断时间显著缩短(中位数为972 s;差570 - 1320;95% CI 675-1200)与无(中位数,1920年;差1320 - 2640;95% ci 1710-2370)。对于诊断困难的重症病例,DeepSeek-R1能够生成高质量的信息,实现合理的诊断准确率,显著提高住院医师的诊断准确率和诊断效率。推理模型被认为是有前途的诊断辅助在重症监护病房。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Critical Care
Critical Care 医学-危重病医学
CiteScore
20.60
自引率
3.30%
发文量
348
审稿时长
1.5 months
期刊介绍: Critical Care is an esteemed international medical journal that undergoes a rigorous peer-review process to maintain its high quality standards. Its primary objective is to enhance the healthcare services offered to critically ill patients. To achieve this, the journal focuses on gathering, exchanging, disseminating, and endorsing evidence-based information that is highly relevant to intensivists. By doing so, Critical Care seeks to provide a thorough and inclusive examination of the intensive care field.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信