Comparative benchmarking of the DeepSeek large language model on medical tasks and clinical reasoning

IF 58.7 1区 医学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY
Mickael Tordjman, Zelong Liu, Murat Yuce, Valentin Fauveau, Yunhao Mei, Jerome Hadjadj, Ian Bolger, Haidara Almansour, Carolyn Horst, Ashwin Singh Parihar, Amine Geahchan, Anis Meribout, Nader Yatim, Nicole Ng, Phillip Robson, Alexander Zhou, Sara Lewis, Mingqian Huang, Timothy Deyer, Bachir Taouli, Hao-Chih Lee, Zahi A. Fayad, Xueyan Mei
{"title":"Comparative benchmarking of the DeepSeek large language model on medical tasks and clinical reasoning","authors":"Mickael Tordjman, Zelong Liu, Murat Yuce, Valentin Fauveau, Yunhao Mei, Jerome Hadjadj, Ian Bolger, Haidara Almansour, Carolyn Horst, Ashwin Singh Parihar, Amine Geahchan, Anis Meribout, Nader Yatim, Nicole Ng, Phillip Robson, Alexander Zhou, Sara Lewis, Mingqian Huang, Timothy Deyer, Bachir Taouli, Hao-Chih Lee, Zahi A. Fayad, Xueyan Mei","doi":"10.1038/s41591-025-03726-3","DOIUrl":null,"url":null,"abstract":"<p>DeepSeek is a newly introduced large language model (LLM) designed for enhanced reasoning, but its medical-domain capabilities have not yet been evaluated. This study assessed the capabilities of three LLMs— DeepSeek-R1, ChatGPT-o1, and Llama 3.1-405B—in performing four different medical tasks: answering questions from the United States Medical Licensing Examination (USMLE), interpreting and reasoning based on text-based diagnostic and management cases, providing tumor classification according to RECIST 1.1 criteria, and providing summaries of diagnostic imaging reports across multiple modalities. In the USMLE test, the performance of DeepSeek-R1(accuracy=0.92) was slightly inferior to that of ChatGPT-o1(accuracy=0.95; p = 0.04) but better than that of Llama 3.1-405B (accuracy=0.83; p &lt; 10<sup>-3</sup>). For text-based case challenges, DeepSeek-R1 performed similarly to ChatGPT-o1 (accuracy of 0.57 vs 0.55; p = 0.76 and 0.74 vs 0.76; p = 0.06, using New England Journal of Medicine and Medicilline databases, respectively). For RECIST classifications, DeepSeek-R1 also performed similarly to ChatGPT-o1 (0.73 vs 0.81; p = 0.10). Diagnostic reasoning steps provided by DeepSeek were deemed more accurate than those provided by ChatGPT and Llama 3.1-405B (average Likert score of 3.61, 3.22, and 3.13, respectively, p = 0.005 and p &lt; 10<sup>−3</sup>). However, summarized imaging reports provided by DeepSeek-R1 exhibited lower global quality than those provided by ChatGPT-o1 (5-point Likert score: 4.5 vs 4.8; p &lt; 10<sup>−3</sup>). This study highlights the potential of DeepSeek-R1 LLM for medical applications but also underlines areas needing improvements.</p>","PeriodicalId":19037,"journal":{"name":"Nature Medicine","volume":"54 1","pages":""},"PeriodicalIF":58.7000,"publicationDate":"2025-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Nature Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1038/s41591-025-03726-3","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

DeepSeek is a newly introduced large language model (LLM) designed for enhanced reasoning, but its medical-domain capabilities have not yet been evaluated. This study assessed the capabilities of three LLMs— DeepSeek-R1, ChatGPT-o1, and Llama 3.1-405B—in performing four different medical tasks: answering questions from the United States Medical Licensing Examination (USMLE), interpreting and reasoning based on text-based diagnostic and management cases, providing tumor classification according to RECIST 1.1 criteria, and providing summaries of diagnostic imaging reports across multiple modalities. In the USMLE test, the performance of DeepSeek-R1(accuracy=0.92) was slightly inferior to that of ChatGPT-o1(accuracy=0.95; p = 0.04) but better than that of Llama 3.1-405B (accuracy=0.83; p < 10-3). For text-based case challenges, DeepSeek-R1 performed similarly to ChatGPT-o1 (accuracy of 0.57 vs 0.55; p = 0.76 and 0.74 vs 0.76; p = 0.06, using New England Journal of Medicine and Medicilline databases, respectively). For RECIST classifications, DeepSeek-R1 also performed similarly to ChatGPT-o1 (0.73 vs 0.81; p = 0.10). Diagnostic reasoning steps provided by DeepSeek were deemed more accurate than those provided by ChatGPT and Llama 3.1-405B (average Likert score of 3.61, 3.22, and 3.13, respectively, p = 0.005 and p < 10−3). However, summarized imaging reports provided by DeepSeek-R1 exhibited lower global quality than those provided by ChatGPT-o1 (5-point Likert score: 4.5 vs 4.8; p < 10−3). This study highlights the potential of DeepSeek-R1 LLM for medical applications but also underlines areas needing improvements.

DeepSeek 大型语言模型在医疗任务和临床推理方面的基准比较
DeepSeek 是一种新推出的大型语言模型 (LLM),旨在增强推理能力,但其在医疗领域的能力尚未得到评估。本研究评估了三种 LLM(DeepSeek-R1、ChatGPT-o1 和 Llama 3.1-405B)执行四种不同医疗任务的能力:回答美国医学执业资格考试(USMLE)的问题、根据基于文本的诊断和管理案例进行解释和推理、根据 RECIST 1.1 标准提供肿瘤分类,以及提供多种模式的影像诊断报告摘要。在 USMLE 测试中,DeepSeek-R1(准确率=0.92)的表现略逊于 ChatGPT-o1(准确率=0.95;p = 0.04),但优于 Llama 3.1-405B(准确率=0.83;p <;10-3)。对于基于文本的病例挑战,DeepSeek-R1 的表现与 ChatGPT-o1 相似(使用《新英格兰医学杂志》和 Medicilline 数据库,准确率分别为 0.57 vs 0.55; p = 0.76 和 0.74 vs 0.76; p = 0.06)。在 RECIST 分类方面,DeepSeek-R1 的表现也与 ChatGPT-o1 相似(0.73 vs 0.81;p = 0.10)。DeepSeek 提供的诊断推理步骤被认为比 ChatGPT 和 Llama 3.1-405B 提供的步骤更准确(平均 Likert 分数分别为 3.61、3.22 和 3.13,p = 0.005 和 p <10-3)。然而,DeepSeek-R1 提供的成像报告摘要的总体质量低于 ChatGPT-o1 提供的报告摘要(5 点 Likert 评分:4.5 vs 4.8;p <;10-3)。这项研究凸显了 DeepSeek-R1 LLM 在医疗应用方面的潜力,但也强调了需要改进的地方。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Nature Medicine
Nature Medicine 医学-生化与分子生物学
CiteScore
100.90
自引率
0.70%
发文量
525
审稿时长
1 months
期刊介绍: Nature Medicine is a monthly journal publishing original peer-reviewed research in all areas of medicine. The publication focuses on originality, timeliness, interdisciplinary interest, and the impact on improving human health. In addition to research articles, Nature Medicine also publishes commissioned content such as News, Reviews, and Perspectives. This content aims to provide context for the latest advances in translational and clinical research, reaching a wide audience of M.D. and Ph.D. readers. All editorial decisions for the journal are made by a team of full-time professional editors. Nature Medicine consider all types of clinical research, including: -Case-reports and small case series -Clinical trials, whether phase 1, 2, 3 or 4 -Observational studies -Meta-analyses -Biomarker studies -Public and global health studies Nature Medicine is also committed to facilitating communication between translational and clinical researchers. As such, we consider “hybrid” studies with preclinical and translational findings reported alongside data from clinical studies.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信