Medical reasoning in LLMs: an in-depth analysis of DeepSeek R1.

IF 4.7 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Frontiers in Artificial Intelligence Pub Date : 2025-06-18 eCollection Date: 2025-01-01 DOI:10.3389/frai.2025.1616145
Birger Moëll, Fredrik Sand Aronsson, Sanian Akbar
{"title":"Medical reasoning in LLMs: an in-depth analysis of DeepSeek R1.","authors":"Birger Moëll, Fredrik Sand Aronsson, Sanian Akbar","doi":"10.3389/frai.2025.1616145","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>The integration of large language models (LLMs) into healthcare holds immense promise, but also raises critical challenges, particularly regarding the interpretability and reliability of their reasoning processes. While models like DeepSeek R1-which incorporates explicit reasoning steps-show promise in enhancing performance and explainability, their alignment with domain-specific expert reasoning remains understudied.</p><p><strong>Methods: </strong>This paper evaluates the medical reasoning capabilities of DeepSeek R1, comparing its outputs to the reasoning patterns of medical domain experts.</p><p><strong>Results: </strong>Through qualitative and quantitative analyses of 100 diverse clinical cases from the MedQA dataset, we demonstrate that DeepSeek R1 achieves 93% diagnostic accuracy and shows patterns of medical reasoning. Analysis of the seven error cases revealed several recurring errors: anchoring bias, difficulty integrating conflicting data, limited consideration of alternative diagnoses, overthinking, incomplete knowledge, and prioritizing definitive treatment over crucial intermediate steps.</p><p><strong>Discussion: </strong>These findings highlight areas for improvement in LLM reasoning for medical applications. Notably the length of reasoning was important with longer responses having a higher probability for error. The marked disparity in reasoning length suggests that extended explanations may signal uncertainty or reflect attempts to rationalize incorrect conclusions. Shorter responses (e.g., under 5,000 characters) were strongly associated with accuracy, providing a practical threshold for assessing confidence in model-generated answers. Beyond observed reasoning errors, the LLM demonstrated sound clinical judgment by systematically evaluating patient information, forming a differential diagnosis, and selecting appropriate treatment based on established guidelines, drug efficacy, resistance patterns, and patient-specific factors. This ability to integrate complex information and apply clinical knowledge highlights the potential of LLMs for supporting medical decision-making through artificial medical reasoning.</p>","PeriodicalId":33315,"journal":{"name":"Frontiers in Artificial Intelligence","volume":"8 ","pages":"1616145"},"PeriodicalIF":4.7000,"publicationDate":"2025-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12213874/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/frai.2025.1616145","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Introduction: The integration of large language models (LLMs) into healthcare holds immense promise, but also raises critical challenges, particularly regarding the interpretability and reliability of their reasoning processes. While models like DeepSeek R1-which incorporates explicit reasoning steps-show promise in enhancing performance and explainability, their alignment with domain-specific expert reasoning remains understudied.

Methods: This paper evaluates the medical reasoning capabilities of DeepSeek R1, comparing its outputs to the reasoning patterns of medical domain experts.

Results: Through qualitative and quantitative analyses of 100 diverse clinical cases from the MedQA dataset, we demonstrate that DeepSeek R1 achieves 93% diagnostic accuracy and shows patterns of medical reasoning. Analysis of the seven error cases revealed several recurring errors: anchoring bias, difficulty integrating conflicting data, limited consideration of alternative diagnoses, overthinking, incomplete knowledge, and prioritizing definitive treatment over crucial intermediate steps.

Discussion: These findings highlight areas for improvement in LLM reasoning for medical applications. Notably the length of reasoning was important with longer responses having a higher probability for error. The marked disparity in reasoning length suggests that extended explanations may signal uncertainty or reflect attempts to rationalize incorrect conclusions. Shorter responses (e.g., under 5,000 characters) were strongly associated with accuracy, providing a practical threshold for assessing confidence in model-generated answers. Beyond observed reasoning errors, the LLM demonstrated sound clinical judgment by systematically evaluating patient information, forming a differential diagnosis, and selecting appropriate treatment based on established guidelines, drug efficacy, resistance patterns, and patient-specific factors. This ability to integrate complex information and apply clinical knowledge highlights the potential of LLMs for supporting medical decision-making through artificial medical reasoning.

法学硕士中的医学推理:深度分析DeepSeek R1。
将大型语言模型(llm)集成到医疗保健中具有巨大的前景,但也提出了关键的挑战,特别是在其推理过程的可解释性和可靠性方面。虽然像DeepSeek r1这样的模型(包含明确的推理步骤)在提高性能和可解释性方面表现出了希望,但它们与特定领域专家推理的一致性仍有待研究。方法:本文评估了DeepSeek R1的医学推理能力,并将其输出与医学领域专家的推理模式进行了比较。结果:通过对MedQA数据集中100个不同临床病例的定性和定量分析,我们证明DeepSeek R1的诊断准确率达到93%,并显示出医学推理模式。对七个错误病例的分析揭示了几个反复出现的错误:锚定偏差,难以整合相互冲突的数据,对替代诊断的考虑有限,过度思考,不完整的知识,以及优先考虑明确的治疗而不是关键的中间步骤。讨论:这些发现突出了医学应用法学硕士推理有待改进的领域。值得注意的是,推理的长度很重要,较长的回答有较高的出错概率。推理长度上的显著差异表明,延长的解释可能表明不确定性,或者反映出试图将错误结论合理化。较短的回答(例如,低于5000个字符)与准确性密切相关,为评估模型生成答案的可信度提供了一个实用的阈值。除了观察到的推理错误之外,LLM通过系统地评估患者信息,形成鉴别诊断,并根据既定指南,药物疗效,耐药性模式和患者特定因素选择适当的治疗,证明了良好的临床判断。这种整合复杂信息和应用临床知识的能力突出了llm通过人工医学推理支持医疗决策的潜力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
6.10
自引率
2.50%
发文量
272
审稿时长
13 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信