Systematic evaluation of the DeepSeek large language model for clinical diagnostic reasoning.

IF 2.6 3区综合性期刊 Q1 MULTIDISCIPLINARY SCIENCES

PLoS ONE Pub Date : 2026-05-08 eCollection Date: 2026-01-01 DOI:10.1371/journal.pone.0346078

Yang Wang, Yang He, Xuchang Qin, Yucai Hong, Lin Chen, Jing Zhang, Hongying Ni, Zhongheng Zhang

{"title":"Systematic evaluation of the DeepSeek large language model for clinical diagnostic reasoning.","authors":"Yang Wang, Yang He, Xuchang Qin, Yucai Hong, Lin Chen, Jing Zhang, Hongying Ni, Zhongheng Zhang","doi":"10.1371/journal.pone.0346078","DOIUrl":null,"url":null,"abstract":"Background: Artificial intelligence (AI) is undergoing an era of transformative advancement, particularly through the emergence of Transformer-based large language models (LLMs). While these systems demonstrate strong reasoning and generalization capabilities, their clinical applicability, particularly in emergency and critical care decision-making, remains underexplored.. In time-sensitive settings, diagnostic reasoning must align rigorously with evidence-based standards and ensure the relevance of timing to clinical decisions.Objective: This study aims to provide a preliminary evaluation of the decision-support performance of the DeepSeek model in acute medical scenarios. We systematically evaluate its diagnostic reasoning, temporal consistency of recommendations, and adherence to evidence-based critical care protocols using standardized case-based assessments.Methods: Twenty-nine representative clinical cases were extracted from the Merck Manual of Diagnosis and Therapy, a widely used medical reference providing standardized case descriptions. The model's outputs were evaluated across four decision-making dimensions: differential diagnosis, diagnostic testing, final diagnosis, and management planning. Human raters scored each response for accuracy, and multivariable linear regression was applied to assess associations between performance and case parameters (age, gender, and Rapid Emergency Medicine Score [REMS]).Results: DeepSeek achieved an overall mean accuracy of 82.9% (95% CI: 80.2-85.6%) across all cases. Accuracy peaked in final diagnosis (97.7%), but declined in differential diagnosis (73.0%). Model performance showed no significant variation across demographic or severity strata.Conclusions: DeepSeek shows promising performance in structured case-based diagnostic tasks, particularly in confirmatory diagnostic reasoning. However, its early-stage reasoning and handling of ambiguous cases require enhancement. Future studies using larger and more diverse clinical datasets are needed to further evaluate the model's robustness and potential clinical applicability.","PeriodicalId":20189,"journal":{"name":"PLoS ONE","volume":"21 5","pages":"e0346078"},"PeriodicalIF":2.6000,"publicationDate":"2026-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLoS ONE","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.1371/journal.pone.0346078","RegionNum":3,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2026/1/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Artificial intelligence (AI) is undergoing an era of transformative advancement, particularly through the emergence of Transformer-based large language models (LLMs). While these systems demonstrate strong reasoning and generalization capabilities, their clinical applicability, particularly in emergency and critical care decision-making, remains underexplored.. In time-sensitive settings, diagnostic reasoning must align rigorously with evidence-based standards and ensure the relevance of timing to clinical decisions.

Objective: This study aims to provide a preliminary evaluation of the decision-support performance of the DeepSeek model in acute medical scenarios. We systematically evaluate its diagnostic reasoning, temporal consistency of recommendations, and adherence to evidence-based critical care protocols using standardized case-based assessments.

Methods: Twenty-nine representative clinical cases were extracted from the Merck Manual of Diagnosis and Therapy, a widely used medical reference providing standardized case descriptions. The model's outputs were evaluated across four decision-making dimensions: differential diagnosis, diagnostic testing, final diagnosis, and management planning. Human raters scored each response for accuracy, and multivariable linear regression was applied to assess associations between performance and case parameters (age, gender, and Rapid Emergency Medicine Score [REMS]).

Results: DeepSeek achieved an overall mean accuracy of 82.9% (95% CI: 80.2-85.6%) across all cases. Accuracy peaked in final diagnosis (97.7%), but declined in differential diagnosis (73.0%). Model performance showed no significant variation across demographic or severity strata.

Conclusions: DeepSeek shows promising performance in structured case-based diagnostic tasks, particularly in confirmatory diagnostic reasoning. However, its early-stage reasoning and handling of ambiguous cases require enhancement. Future studies using larger and more diverse clinical datasets are needed to further evaluate the model's robustness and potential clinical applicability.

查看原文本刊更多论文

用于临床诊断推理的DeepSeek大型语言模型的系统评价。

背景：人工智能（AI）正在经历一个变革进步的时代，特别是通过基于transformer的大型语言模型（llm）的出现。虽然这些系统显示出强大的推理和泛化能力，但它们的临床适用性，特别是在急诊和重症护理决策方面，仍未得到充分探索。在时间敏感的情况下，诊断推理必须严格符合循证标准，并确保时间与临床决策的相关性。目的：本研究旨在初步评估DeepSeek模型在急性医疗场景下的决策支持性能。我们使用标准化的基于病例的评估系统地评估其诊断推理、建议的时间一致性以及对循证重症监护协议的依从性。方法：从广泛使用的医学参考书《默克诊疗手册》中提取29例具有代表性的临床病例，提供标准化的病例描述。该模型的输出在四个决策维度上进行评估：鉴别诊断、诊断测试、最终诊断和管理计划。人类评分员对每个反应的准确性进行评分，并应用多变量线性回归评估表现与病例参数（年龄、性别和快速急诊医学评分[REMS]）之间的关系。结果：在所有病例中，DeepSeek的总体平均准确率为82.9% （95% CI: 80.2-85.6%）。最终诊断的准确率最高（97.7%），但鉴别诊断的准确率下降（73.0%）。模型性能在人口统计学或严重程度层之间没有显着变化。结论：DeepSeek在结构化的基于案例的诊断任务中表现出良好的性能，特别是在确认性诊断推理中。然而，它的早期推理和处理模棱两可的情况需要加强。未来的研究需要使用更大、更多样化的临床数据集来进一步评估该模型的稳健性和潜在的临床适用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

PLoS ONE 生物-生物学

CiteScore

6.20

自引率

5.40%

发文量

14242

审稿时长

3.7 months

期刊介绍： PLOS ONE is an international, peer-reviewed, open-access, online publication. PLOS ONE welcomes reports on primary research from any scientific discipline. It provides: * Open-access—freely accessible online, authors retain copyright * Fast publication times * Peer review by expert, practicing researchers * Post-publication tools to indicate quality and impact * Community-based dialogue on articles * Worldwide media coverage