DeepSeek-R1 vs OpenAI o1 for Ophthalmic Diagnoses and Management Plans.

IF 9.2 1区 医学 Q1 OPHTHALMOLOGY
David Mikhail,Andrew Farah,Jason Milad,Andrew Mihalache,Daniel Milad,Fares Antaki,Michael Balas,Marko M Popovic,Rajeev H Muni,Pearse A Keane,Renaud Duval
{"title":"DeepSeek-R1 vs OpenAI o1 for Ophthalmic Diagnoses and Management Plans.","authors":"David Mikhail,Andrew Farah,Jason Milad,Andrew Mihalache,Daniel Milad,Fares Antaki,Michael Balas,Marko M Popovic,Rajeev H Muni,Pearse A Keane,Renaud Duval","doi":"10.1001/jamaophthalmol.2025.2918","DOIUrl":null,"url":null,"abstract":"Importance\r\nLarge language models (LLMs) are increasingly being explored in clinical decision-making, but few studies have evaluated their performance on complex ophthalmology cases from clinical practice settings. Understanding whether open-weight, reasoning-enhanced LLMs can outperform proprietary models has implications for clinical utility and accessibility.\r\n\r\nObjective\r\nTo evaluate the diagnostic accuracy, management decision-making, and cost of DeepSeek-R1 vs OpenAI o1 across diverse ophthalmic subspecialties.\r\n\r\nDesign, Setting, and Participants\r\nThis was a cross-sectional evaluation conducted using standardized prompts and model configurations. Clinical cases were sourced from JAMA Ophthalmology's Clinical Challenge articles, containing complex cases from clinical practice settings. Each case included an open-ended diagnostic question and a multiple-choice next-step decision. All cases were included without exclusions, and no human participants were involved. Data were analyzed from March 13 to March 30, 2025.\r\n\r\nExposures\r\nDeepSeek-R1and OpenAI o1 were evaluated using the Plan-and-Solve Plus (PS+) prompt engineering method.\r\n\r\nMain Outcomes and Measures\r\nPrimary outcomes were diagnostic accuracy and next-step decision-making accuracy, defined as the proportion of correct responses. Token cost analyses were performed to estimate expenses. Intermodel agreement was evaluated using Cohen κ, and McNemar test was used to compare performance.\r\n\r\nResults\r\nA total of 422 clinical cases were included, spanning 10 subspecialties. DeepSeek-R1 achieved a higher diagnostic accuracy of 70.4% (297 of 422 cases) compared with 63.0% (266 of 422 cases) for OpenAI o1, a 7.3% difference (95% CI, 1.0%-13.7%; P = .02). For next-step decisions, DeepSeek-R1 was correct in 82.7% of cases (349 of 422 cases) vs OpenAI o1's accuracy of 75.8% (320 of 422 cases), a 6.9% difference (95% CI, 1.4%-12.3%; P = .01). Intermodel agreement was moderate (κ = 0.422; 95% CI, 0.375-0.469; P < .001). DeepSeek-R1 offered lower costs per query than OpenAI o1, with savings exceeding 66-fold (up to 98.5%) during off-peak pricing.\r\n\r\nConclusions and Relevance\r\nDeepSeek-R1 outperformed OpenAI o1 in diagnosis and management across subspecialties while lowering operating costs, supporting the potential of open-weight, reinforcement learning-augmented LLMs as scalable and cost-saving tools for clinical decision support. Further investigations should evaluate safety guardrails and assess performance of self-hosted adaptations of DeepSeek-R1 with domain-specific ophthalmic expertise to optimize clinical utility.","PeriodicalId":14518,"journal":{"name":"JAMA ophthalmology","volume":"31 1","pages":""},"PeriodicalIF":9.2000,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JAMA ophthalmology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1001/jamaophthalmol.2025.2918","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"OPHTHALMOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Importance Large language models (LLMs) are increasingly being explored in clinical decision-making, but few studies have evaluated their performance on complex ophthalmology cases from clinical practice settings. Understanding whether open-weight, reasoning-enhanced LLMs can outperform proprietary models has implications for clinical utility and accessibility. Objective To evaluate the diagnostic accuracy, management decision-making, and cost of DeepSeek-R1 vs OpenAI o1 across diverse ophthalmic subspecialties. Design, Setting, and Participants This was a cross-sectional evaluation conducted using standardized prompts and model configurations. Clinical cases were sourced from JAMA Ophthalmology's Clinical Challenge articles, containing complex cases from clinical practice settings. Each case included an open-ended diagnostic question and a multiple-choice next-step decision. All cases were included without exclusions, and no human participants were involved. Data were analyzed from March 13 to March 30, 2025. Exposures DeepSeek-R1and OpenAI o1 were evaluated using the Plan-and-Solve Plus (PS+) prompt engineering method. Main Outcomes and Measures Primary outcomes were diagnostic accuracy and next-step decision-making accuracy, defined as the proportion of correct responses. Token cost analyses were performed to estimate expenses. Intermodel agreement was evaluated using Cohen κ, and McNemar test was used to compare performance. Results A total of 422 clinical cases were included, spanning 10 subspecialties. DeepSeek-R1 achieved a higher diagnostic accuracy of 70.4% (297 of 422 cases) compared with 63.0% (266 of 422 cases) for OpenAI o1, a 7.3% difference (95% CI, 1.0%-13.7%; P = .02). For next-step decisions, DeepSeek-R1 was correct in 82.7% of cases (349 of 422 cases) vs OpenAI o1's accuracy of 75.8% (320 of 422 cases), a 6.9% difference (95% CI, 1.4%-12.3%; P = .01). Intermodel agreement was moderate (κ = 0.422; 95% CI, 0.375-0.469; P < .001). DeepSeek-R1 offered lower costs per query than OpenAI o1, with savings exceeding 66-fold (up to 98.5%) during off-peak pricing. Conclusions and Relevance DeepSeek-R1 outperformed OpenAI o1 in diagnosis and management across subspecialties while lowering operating costs, supporting the potential of open-weight, reinforcement learning-augmented LLMs as scalable and cost-saving tools for clinical decision support. Further investigations should evaluate safety guardrails and assess performance of self-hosted adaptations of DeepSeek-R1 with domain-specific ophthalmic expertise to optimize clinical utility.
DeepSeek-R1与OpenAI 01的眼科诊断和管理计划。
大型语言模型(llm)在临床决策中的应用越来越多,但很少有研究从临床实践的角度评估其在复杂眼科病例中的表现。了解开放重量、推理增强的llm是否优于专有模型,对临床实用性和可及性具有重要意义。目的评价DeepSeek-R1与OpenAI 1在不同眼科亚专科的诊断准确性、管理决策和成本。设计、设置和参与者这是一个使用标准化提示和模型配置进行的横断面评估。临床病例来源于JAMA眼科的临床挑战文章,包含来自临床实践设置的复杂病例。每个病例都包括一个开放式诊断问题和一个选择题。所有病例均纳入,无排除,且不涉及人类受试者。数据分析时间为2025年3月13日至3月30日。使用计划-解决+ (PS+)提示工程方法对exposure deepseek - r1和OpenAI 01进行评估。主要结局和测量方法主要结局是诊断准确性和下一步决策准确性,定义为正确回答的比例。进行代币成本分析以估计费用。采用Cohen κ评价模型间一致性,采用McNemar检验比较性能。结果共纳入临床病例422例,涵盖10个亚专科。DeepSeek-R1的诊断准确率为70.4%(422例中有297例),而OpenAI o1的诊断准确率为63.0%(422例中有266例),差异为7.3% (95% CI, 1.0%-13.7%; P = 0.02)。对于下一步的决策,DeepSeek-R1的准确率为82.7%(422例中的349例),而OpenAI 01的准确率为75.8%(422例中的320例),差异为6.9% (95% CI, 1.4%-12.3%; P = 0.01)。模型间一致性中等(κ = 0.422; 95% CI, 0.375 ~ 0.469; P < 0.001)。DeepSeek-R1提供比OpenAI 01更低的每次查询成本,在非高峰定价期间节省超过66倍(高达98.5%)。deepseek - r1在跨亚专业的诊断和管理方面优于OpenAI 01,同时降低了运营成本,支持开放权重、强化学习增强llm作为可扩展和节省成本的临床决策支持工具的潜力。进一步的研究应评估安全护栏,并评估DeepSeek-R1的自宿主适应性与特定领域的眼科专业知识,以优化临床应用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
JAMA ophthalmology
JAMA ophthalmology OPHTHALMOLOGY-
CiteScore
13.20
自引率
3.70%
发文量
340
期刊介绍: JAMA Ophthalmology, with a rich history of continuous publication since 1869, stands as a distinguished international, peer-reviewed journal dedicated to ophthalmology and visual science. In 2019, the journal proudly commemorated 150 years of uninterrupted service to the field. As a member of the esteemed JAMA Network, a consortium renowned for its peer-reviewed general medical and specialty publications, JAMA Ophthalmology upholds the highest standards of excellence in disseminating cutting-edge research and insights. Join us in celebrating our legacy and advancing the frontiers of ophthalmology and visual science.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信