Yang Wang, Yang He, Xuchang Qin, Yucai Hong, Lin Chen, Jing Zhang, Hongying Ni, Zhongheng Zhang
{"title":"Systematic evaluation of the DeepSeek large language model for clinical diagnostic reasoning.","authors":"Yang Wang, Yang He, Xuchang Qin, Yucai Hong, Lin Chen, Jing Zhang, Hongying Ni, Zhongheng Zhang","doi":"10.1371/journal.pone.0346078","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Artificial intelligence (AI) is undergoing an era of transformative advancement, particularly through the emergence of Transformer-based large language models (LLMs). While these systems demonstrate strong reasoning and generalization capabilities, their clinical applicability, particularly in emergency and critical care decision-making, remains underexplored.. In time-sensitive settings, diagnostic reasoning must align rigorously with evidence-based standards and ensure the relevance of timing to clinical decisions.</p><p><strong>Objective: </strong>This study aims to provide a preliminary evaluation of the decision-support performance of the DeepSeek model in acute medical scenarios. We systematically evaluate its diagnostic reasoning, temporal consistency of recommendations, and adherence to evidence-based critical care protocols using standardized case-based assessments.</p><p><strong>Methods: </strong>Twenty-nine representative clinical cases were extracted from the Merck Manual of Diagnosis and Therapy, a widely used medical reference providing standardized case descriptions. The model's outputs were evaluated across four decision-making dimensions: differential diagnosis, diagnostic testing, final diagnosis, and management planning. Human raters scored each response for accuracy, and multivariable linear regression was applied to assess associations between performance and case parameters (age, gender, and Rapid Emergency Medicine Score [REMS]).</p><p><strong>Results: </strong>DeepSeek achieved an overall mean accuracy of 82.9% (95% CI: 80.2-85.6%) across all cases. Accuracy peaked in final diagnosis (97.7%), but declined in differential diagnosis (73.0%). Model performance showed no significant variation across demographic or severity strata.</p><p><strong>Conclusions: </strong>DeepSeek shows promising performance in structured case-based diagnostic tasks, particularly in confirmatory diagnostic reasoning. However, its early-stage reasoning and handling of ambiguous cases require enhancement. Future studies using larger and more diverse clinical datasets are needed to further evaluate the model's robustness and potential clinical applicability.</p>","PeriodicalId":20189,"journal":{"name":"PLoS ONE","volume":"21 5","pages":"e0346078"},"PeriodicalIF":2.6000,"publicationDate":"2026-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLoS ONE","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.1371/journal.pone.0346078","RegionNum":3,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2026/1/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Artificial intelligence (AI) is undergoing an era of transformative advancement, particularly through the emergence of Transformer-based large language models (LLMs). While these systems demonstrate strong reasoning and generalization capabilities, their clinical applicability, particularly in emergency and critical care decision-making, remains underexplored.. In time-sensitive settings, diagnostic reasoning must align rigorously with evidence-based standards and ensure the relevance of timing to clinical decisions.
Objective: This study aims to provide a preliminary evaluation of the decision-support performance of the DeepSeek model in acute medical scenarios. We systematically evaluate its diagnostic reasoning, temporal consistency of recommendations, and adherence to evidence-based critical care protocols using standardized case-based assessments.
Methods: Twenty-nine representative clinical cases were extracted from the Merck Manual of Diagnosis and Therapy, a widely used medical reference providing standardized case descriptions. The model's outputs were evaluated across four decision-making dimensions: differential diagnosis, diagnostic testing, final diagnosis, and management planning. Human raters scored each response for accuracy, and multivariable linear regression was applied to assess associations between performance and case parameters (age, gender, and Rapid Emergency Medicine Score [REMS]).
Results: DeepSeek achieved an overall mean accuracy of 82.9% (95% CI: 80.2-85.6%) across all cases. Accuracy peaked in final diagnosis (97.7%), but declined in differential diagnosis (73.0%). Model performance showed no significant variation across demographic or severity strata.
Conclusions: DeepSeek shows promising performance in structured case-based diagnostic tasks, particularly in confirmatory diagnostic reasoning. However, its early-stage reasoning and handling of ambiguous cases require enhancement. Future studies using larger and more diverse clinical datasets are needed to further evaluate the model's robustness and potential clinical applicability.
期刊介绍:
PLOS ONE is an international, peer-reviewed, open-access, online publication. PLOS ONE welcomes reports on primary research from any scientific discipline. It provides:
* Open-access—freely accessible online, authors retain copyright
* Fast publication times
* Peer review by expert, practicing researchers
* Post-publication tools to indicate quality and impact
* Community-based dialogue on articles
* Worldwide media coverage