Open-Source Large Language Models Distilled DeepSeek-R1 Pose Challenges for On-Premises Clinical Deployment in Medical Diagnosis: A Comparative Study of Performance.

IF 5.7 3区医学 Q1 HEALTH CARE SCIENCES & SERVICES

Journal of Medical Systems Pub Date : 2026-05-01 DOI:10.1007/s10916-026-02390-5

Wei Zhong, Yiyao Fu, Dingchuan Peng, Yifan Liu, Yan Liu, Kai Yang, Huimin Gao, Huihui Yan, Wenjing Hao, Yousheng Yan, Chenghong Yin

{"title":"Open-Source Large Language Models Distilled DeepSeek-R1 Pose Challenges for On-Premises Clinical Deployment in Medical Diagnosis: A Comparative Study of Performance.","authors":"Wei Zhong, Yiyao Fu, Dingchuan Peng, Yifan Liu, Yan Liu, Kai Yang, Huimin Gao, Huihui Yan, Wenjing Hao, Yousheng Yan, Chenghong Yin","doi":"10.1007/s10916-026-02390-5","DOIUrl":null,"url":null,"abstract":"<p><p>The open-source reasoning large language model DeepSeek-R1 is increasingly being used in hospitals, but its multiple parameter versions, especially the distilled models, have not been fully evaluated for diagnostic performance. To address this, paired comparisons were conducted using five DeepSeek-R1 models and their respective base models. The models were tested on a diagnostic dataset of 110 simulated clinical cases from open access data, covering internal medicine, surgery, neurology, gynecology, and pediatrics, and categorized by incidence (frequent, less frequent, rare). The models were tasked with generating five preliminary diagnoses based on clinical symptoms, and a response was considered correct if the accurate diagnosis was included in the five generated. The model pairings were DeepSeek-R1-8B vs. Llama3.1-8B, DeepSeek-R1-14B vs. Qwen2.5-14B, DeepSeek-R1-32B vs. Qwen2.5-32B, DeepSeek-R1-70B vs. Llama3.3-70B, and DeepSeek-R1-671B vs. DeepSeek-V3. All reasoning models except DeepSeek-R1-671B were distilled versions. Diagnostic accuracy was assessed using McNemar's test for discordant pairs, with a significance threshold of 0.01. The results showed that DeepSeek-R1-671B significantly outperformed DeepSeek-V3 (95.45% vs. 88.18%; p = 0.008), while DeepSeek-R1-8B underperformed relative to Llama3.1-8B (47.27% vs. 64.54%; p = 0.003). No significant differences were observed for the mid-sized models. Subgroup analyses based on incidence and clinical specialties further supported these conclusions. Qualitative analysis of the chain-of-thought outputs in incorrect cases revealed three universally prevalent error modes across distilled models: Reasoning drift, Red-Flag recognition failure, and diagnostic priority inversion. The study concludes that the DeepSeek-R1-671B shows potential for medical diagnosis, but distilled models do not exceed their base models. Based on simulated clinical cases, our results do not support deploying distilled models for text-based diagnostic tasks without further validation on real patient data.</p>","PeriodicalId":16338,"journal":{"name":"Journal of Medical Systems","volume":"50 1","pages":""},"PeriodicalIF":5.7000,"publicationDate":"2026-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Medical Systems","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s10916-026-02390-5","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

The open-source reasoning large language model DeepSeek-R1 is increasingly being used in hospitals, but its multiple parameter versions, especially the distilled models, have not been fully evaluated for diagnostic performance. To address this, paired comparisons were conducted using five DeepSeek-R1 models and their respective base models. The models were tested on a diagnostic dataset of 110 simulated clinical cases from open access data, covering internal medicine, surgery, neurology, gynecology, and pediatrics, and categorized by incidence (frequent, less frequent, rare). The models were tasked with generating five preliminary diagnoses based on clinical symptoms, and a response was considered correct if the accurate diagnosis was included in the five generated. The model pairings were DeepSeek-R1-8B vs. Llama3.1-8B, DeepSeek-R1-14B vs. Qwen2.5-14B, DeepSeek-R1-32B vs. Qwen2.5-32B, DeepSeek-R1-70B vs. Llama3.3-70B, and DeepSeek-R1-671B vs. DeepSeek-V3. All reasoning models except DeepSeek-R1-671B were distilled versions. Diagnostic accuracy was assessed using McNemar's test for discordant pairs, with a significance threshold of 0.01. The results showed that DeepSeek-R1-671B significantly outperformed DeepSeek-V3 (95.45% vs. 88.18%; p = 0.008), while DeepSeek-R1-8B underperformed relative to Llama3.1-8B (47.27% vs. 64.54%; p = 0.003). No significant differences were observed for the mid-sized models. Subgroup analyses based on incidence and clinical specialties further supported these conclusions. Qualitative analysis of the chain-of-thought outputs in incorrect cases revealed three universally prevalent error modes across distilled models: Reasoning drift, Red-Flag recognition failure, and diagnostic priority inversion. The study concludes that the DeepSeek-R1-671B shows potential for medical diagnosis, but distilled models do not exceed their base models. Based on simulated clinical cases, our results do not support deploying distilled models for text-based diagnostic tasks without further validation on real patient data.

查看原文本刊更多论文

DeepSeek-R1提炼的开源大型语言模型对医疗诊断中的本地临床部署提出了挑战：性能比较研究。

开源推理大型语言模型DeepSeek-R1越来越多地在医院中使用，但其多参数版本，特别是蒸馏模型，尚未得到充分的诊断性能评估。为了解决这个问题，使用五个DeepSeek-R1模型及其各自的基础模型进行了配对比较。这些模型在来自开放获取数据的110个模拟临床病例的诊断数据集上进行了测试，涵盖了内科、外科、神经病学、妇科和儿科，并按发病率（频繁、不频繁、罕见）进行了分类。这些模型的任务是根据临床症状生成五种初步诊断，如果生成的五种诊断中包含准确的诊断，则认为响应是正确的。模型配对为DeepSeek-R1-8B与Llama3.1-8B、DeepSeek-R1-14B与Qwen2.5-14B、DeepSeek-R1-32B与Qwen2.5-32B、DeepSeek-R1-70B与Llama3.3-70B、DeepSeek-R1-671B与DeepSeek-V3。除DeepSeek-R1-671B外，所有推理模型均为提炼版本。采用McNemar检验对不一致对进行诊断准确性评估，显著性阈值为0.01。结果表明，DeepSeek-R1-671B的性能明显优于DeepSeek-V3 (95.45% vs. 88.18%, p = 0.008)，而DeepSeek-R1-8B的性能低于Llama3.1-8B （47.27% vs. 64.54%, p = 0.003）。中型模型没有观察到显著差异。基于发病率和临床专科的亚组分析进一步支持了这些结论。对错误情况下的思维链输出进行定性分析，揭示了蒸馏模型中普遍存在的三种错误模式：推理漂移、红旗识别失败和诊断优先级反转。该研究得出结论，DeepSeek-R1-671B显示出医学诊断的潜力，但提炼的模型并不超过其基本模型。基于模拟的临床病例，我们的结果不支持在没有对真实患者数据进行进一步验证的情况下，为基于文本的诊断任务部署蒸馏模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Medical Systems 医学-卫生保健

CiteScore

11.60

自引率

1.90%

发文量

审稿时长

4.8 months

期刊介绍： Journal of Medical Systems provides a forum for the presentation and discussion of the increasingly extensive applications of new systems techniques and methods in hospital clinic and physician''s office administration; pathology radiology and pharmaceutical delivery systems; medical records storage and retrieval; and ancillary patient-support systems. The journal publishes informative articles essays and studies across the entire scale of medical systems from large hospital programs to novel small-scale medical services. Education is an integral part of this amalgamation of sciences and selected articles are published in this area. Since existing medical systems are constantly being modified to fit particular circumstances and to solve specific problems the journal includes a special section devoted to status reports on current installations.