Mickael Tordjman, Zelong Liu, Murat Yuce, Valentin Fauveau, Yunhao Mei, Jerome Hadjadj, Ian Bolger, Haidara Almansour, Carolyn Horst, Ashwin Singh Parihar, Amine Geahchan, Anis Meribout, Nader Yatim, Nicole Ng, Phillip Robson, Alexander Zhou, Sara Lewis, Mingqian Huang, Timothy Deyer, Bachir Taouli, Hao-Chih Lee, Zahi A. Fayad, Xueyan Mei
{"title":"Comparative benchmarking of the DeepSeek large language model on medical tasks and clinical reasoning","authors":"Mickael Tordjman, Zelong Liu, Murat Yuce, Valentin Fauveau, Yunhao Mei, Jerome Hadjadj, Ian Bolger, Haidara Almansour, Carolyn Horst, Ashwin Singh Parihar, Amine Geahchan, Anis Meribout, Nader Yatim, Nicole Ng, Phillip Robson, Alexander Zhou, Sara Lewis, Mingqian Huang, Timothy Deyer, Bachir Taouli, Hao-Chih Lee, Zahi A. Fayad, Xueyan Mei","doi":"10.1038/s41591-025-03726-3","DOIUrl":null,"url":null,"abstract":"<p>DeepSeek is a newly introduced large language model (LLM) designed for enhanced reasoning, but its medical-domain capabilities have not yet been evaluated. This study assessed the capabilities of three LLMs— DeepSeek-R1, ChatGPT-o1, and Llama 3.1-405B—in performing four different medical tasks: answering questions from the United States Medical Licensing Examination (USMLE), interpreting and reasoning based on text-based diagnostic and management cases, providing tumor classification according to RECIST 1.1 criteria, and providing summaries of diagnostic imaging reports across multiple modalities. In the USMLE test, the performance of DeepSeek-R1(accuracy=0.92) was slightly inferior to that of ChatGPT-o1(accuracy=0.95; p = 0.04) but better than that of Llama 3.1-405B (accuracy=0.83; p < 10<sup>-3</sup>). For text-based case challenges, DeepSeek-R1 performed similarly to ChatGPT-o1 (accuracy of 0.57 vs 0.55; p = 0.76 and 0.74 vs 0.76; p = 0.06, using New England Journal of Medicine and Medicilline databases, respectively). For RECIST classifications, DeepSeek-R1 also performed similarly to ChatGPT-o1 (0.73 vs 0.81; p = 0.10). Diagnostic reasoning steps provided by DeepSeek were deemed more accurate than those provided by ChatGPT and Llama 3.1-405B (average Likert score of 3.61, 3.22, and 3.13, respectively, p = 0.005 and p < 10<sup>−3</sup>). However, summarized imaging reports provided by DeepSeek-R1 exhibited lower global quality than those provided by ChatGPT-o1 (5-point Likert score: 4.5 vs 4.8; p < 10<sup>−3</sup>). This study highlights the potential of DeepSeek-R1 LLM for medical applications but also underlines areas needing improvements.</p>","PeriodicalId":19037,"journal":{"name":"Nature Medicine","volume":"54 1","pages":""},"PeriodicalIF":58.7000,"publicationDate":"2025-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Nature Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1038/s41591-025-03726-3","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
DeepSeek is a newly introduced large language model (LLM) designed for enhanced reasoning, but its medical-domain capabilities have not yet been evaluated. This study assessed the capabilities of three LLMs— DeepSeek-R1, ChatGPT-o1, and Llama 3.1-405B—in performing four different medical tasks: answering questions from the United States Medical Licensing Examination (USMLE), interpreting and reasoning based on text-based diagnostic and management cases, providing tumor classification according to RECIST 1.1 criteria, and providing summaries of diagnostic imaging reports across multiple modalities. In the USMLE test, the performance of DeepSeek-R1(accuracy=0.92) was slightly inferior to that of ChatGPT-o1(accuracy=0.95; p = 0.04) but better than that of Llama 3.1-405B (accuracy=0.83; p < 10-3). For text-based case challenges, DeepSeek-R1 performed similarly to ChatGPT-o1 (accuracy of 0.57 vs 0.55; p = 0.76 and 0.74 vs 0.76; p = 0.06, using New England Journal of Medicine and Medicilline databases, respectively). For RECIST classifications, DeepSeek-R1 also performed similarly to ChatGPT-o1 (0.73 vs 0.81; p = 0.10). Diagnostic reasoning steps provided by DeepSeek were deemed more accurate than those provided by ChatGPT and Llama 3.1-405B (average Likert score of 3.61, 3.22, and 3.13, respectively, p = 0.005 and p < 10−3). However, summarized imaging reports provided by DeepSeek-R1 exhibited lower global quality than those provided by ChatGPT-o1 (5-point Likert score: 4.5 vs 4.8; p < 10−3). This study highlights the potential of DeepSeek-R1 LLM for medical applications but also underlines areas needing improvements.
期刊介绍:
Nature Medicine is a monthly journal publishing original peer-reviewed research in all areas of medicine. The publication focuses on originality, timeliness, interdisciplinary interest, and the impact on improving human health. In addition to research articles, Nature Medicine also publishes commissioned content such as News, Reviews, and Perspectives. This content aims to provide context for the latest advances in translational and clinical research, reaching a wide audience of M.D. and Ph.D. readers. All editorial decisions for the journal are made by a team of full-time professional editors.
Nature Medicine consider all types of clinical research, including:
-Case-reports and small case series
-Clinical trials, whether phase 1, 2, 3 or 4
-Observational studies
-Meta-analyses
-Biomarker studies
-Public and global health studies
Nature Medicine is also committed to facilitating communication between translational and clinical researchers. As such, we consider “hybrid” studies with preclinical and translational findings reported alongside data from clinical studies.