Systematic evaluation of the DeepSeek large language model for clinical diagnostic reasoning.

IF 2.6 3区 综合性期刊 Q1 MULTIDISCIPLINARY SCIENCES
PLoS ONE Pub Date : 2026-05-08 eCollection Date: 2026-01-01 DOI:10.1371/journal.pone.0346078
Yang Wang, Yang He, Xuchang Qin, Yucai Hong, Lin Chen, Jing Zhang, Hongying Ni, Zhongheng Zhang
{"title":"Systematic evaluation of the DeepSeek large language model for clinical diagnostic reasoning.","authors":"Yang Wang, Yang He, Xuchang Qin, Yucai Hong, Lin Chen, Jing Zhang, Hongying Ni, Zhongheng Zhang","doi":"10.1371/journal.pone.0346078","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Artificial intelligence (AI) is undergoing an era of transformative advancement, particularly through the emergence of Transformer-based large language models (LLMs). While these systems demonstrate strong reasoning and generalization capabilities, their clinical applicability, particularly in emergency and critical care decision-making, remains underexplored.. In time-sensitive settings, diagnostic reasoning must align rigorously with evidence-based standards and ensure the relevance of timing to clinical decisions.</p><p><strong>Objective: </strong>This study aims to provide a preliminary evaluation of the decision-support performance of the DeepSeek model in acute medical scenarios. We systematically evaluate its diagnostic reasoning, temporal consistency of recommendations, and adherence to evidence-based critical care protocols using standardized case-based assessments.</p><p><strong>Methods: </strong>Twenty-nine representative clinical cases were extracted from the Merck Manual of Diagnosis and Therapy, a widely used medical reference providing standardized case descriptions. The model's outputs were evaluated across four decision-making dimensions: differential diagnosis, diagnostic testing, final diagnosis, and management planning. Human raters scored each response for accuracy, and multivariable linear regression was applied to assess associations between performance and case parameters (age, gender, and Rapid Emergency Medicine Score [REMS]).</p><p><strong>Results: </strong>DeepSeek achieved an overall mean accuracy of 82.9% (95% CI: 80.2-85.6%) across all cases. Accuracy peaked in final diagnosis (97.7%), but declined in differential diagnosis (73.0%). Model performance showed no significant variation across demographic or severity strata.</p><p><strong>Conclusions: </strong>DeepSeek shows promising performance in structured case-based diagnostic tasks, particularly in confirmatory diagnostic reasoning. However, its early-stage reasoning and handling of ambiguous cases require enhancement. Future studies using larger and more diverse clinical datasets are needed to further evaluate the model's robustness and potential clinical applicability.</p>","PeriodicalId":20189,"journal":{"name":"PLoS ONE","volume":"21 5","pages":"e0346078"},"PeriodicalIF":2.6000,"publicationDate":"2026-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLoS ONE","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.1371/journal.pone.0346078","RegionNum":3,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2026/1/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Artificial intelligence (AI) is undergoing an era of transformative advancement, particularly through the emergence of Transformer-based large language models (LLMs). While these systems demonstrate strong reasoning and generalization capabilities, their clinical applicability, particularly in emergency and critical care decision-making, remains underexplored.. In time-sensitive settings, diagnostic reasoning must align rigorously with evidence-based standards and ensure the relevance of timing to clinical decisions.

Objective: This study aims to provide a preliminary evaluation of the decision-support performance of the DeepSeek model in acute medical scenarios. We systematically evaluate its diagnostic reasoning, temporal consistency of recommendations, and adherence to evidence-based critical care protocols using standardized case-based assessments.

Methods: Twenty-nine representative clinical cases were extracted from the Merck Manual of Diagnosis and Therapy, a widely used medical reference providing standardized case descriptions. The model's outputs were evaluated across four decision-making dimensions: differential diagnosis, diagnostic testing, final diagnosis, and management planning. Human raters scored each response for accuracy, and multivariable linear regression was applied to assess associations between performance and case parameters (age, gender, and Rapid Emergency Medicine Score [REMS]).

Results: DeepSeek achieved an overall mean accuracy of 82.9% (95% CI: 80.2-85.6%) across all cases. Accuracy peaked in final diagnosis (97.7%), but declined in differential diagnosis (73.0%). Model performance showed no significant variation across demographic or severity strata.

Conclusions: DeepSeek shows promising performance in structured case-based diagnostic tasks, particularly in confirmatory diagnostic reasoning. However, its early-stage reasoning and handling of ambiguous cases require enhancement. Future studies using larger and more diverse clinical datasets are needed to further evaluate the model's robustness and potential clinical applicability.

用于临床诊断推理的DeepSeek大型语言模型的系统评价。
背景:人工智能(AI)正在经历一个变革进步的时代,特别是通过基于transformer的大型语言模型(llm)的出现。虽然这些系统显示出强大的推理和泛化能力,但它们的临床适用性,特别是在急诊和重症护理决策方面,仍未得到充分探索。在时间敏感的情况下,诊断推理必须严格符合循证标准,并确保时间与临床决策的相关性。目的:本研究旨在初步评估DeepSeek模型在急性医疗场景下的决策支持性能。我们使用标准化的基于病例的评估系统地评估其诊断推理、建议的时间一致性以及对循证重症监护协议的依从性。方法:从广泛使用的医学参考书《默克诊疗手册》中提取29例具有代表性的临床病例,提供标准化的病例描述。该模型的输出在四个决策维度上进行评估:鉴别诊断、诊断测试、最终诊断和管理计划。人类评分员对每个反应的准确性进行评分,并应用多变量线性回归评估表现与病例参数(年龄、性别和快速急诊医学评分[REMS])之间的关系。结果:在所有病例中,DeepSeek的总体平均准确率为82.9% (95% CI: 80.2-85.6%)。最终诊断的准确率最高(97.7%),但鉴别诊断的准确率下降(73.0%)。模型性能在人口统计学或严重程度层之间没有显着变化。结论:DeepSeek在结构化的基于案例的诊断任务中表现出良好的性能,特别是在确认性诊断推理中。然而,它的早期推理和处理模棱两可的情况需要加强。未来的研究需要使用更大、更多样化的临床数据集来进一步评估该模型的稳健性和潜在的临床适用性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
PLoS ONE
PLoS ONE 生物-生物学
CiteScore
6.20
自引率
5.40%
发文量
14242
审稿时长
3.7 months
期刊介绍: PLOS ONE is an international, peer-reviewed, open-access, online publication. PLOS ONE welcomes reports on primary research from any scientific discipline. It provides: * Open-access—freely accessible online, authors retain copyright * Fast publication times * Peer review by expert, practicing researchers * Post-publication tools to indicate quality and impact * Community-based dialogue on articles * Worldwide media coverage
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信
小红书