DeepSeek-R1 outperforms Gemini 2.0 Pro, OpenAI o1, and o3-mini in bilingual complex ophthalmology reasoning

IF 3.4
Pusheng Xu , Yue Wu , Kai Jin , Xiaolan Chen , Mingguang He , Danli Shi
{"title":"DeepSeek-R1 outperforms Gemini 2.0 Pro, OpenAI o1, and o3-mini in bilingual complex ophthalmology reasoning","authors":"Pusheng Xu ,&nbsp;Yue Wu ,&nbsp;Kai Jin ,&nbsp;Xiaolan Chen ,&nbsp;Mingguang He ,&nbsp;Danli Shi","doi":"10.1016/j.aopr.2025.05.001","DOIUrl":null,"url":null,"abstract":"<div><h3>Purpose</h3><div>To evaluate the accuracy and reasoning ability of DeepSeek-R1 and three recently released large language models (LLMs) in bilingual complex ophthalmology cases.</div></div><div><h3>Methods</h3><div>A total of 130 multiple-choice questions (MCQs) related to diagnosis (n ​= ​39) and management (n ​= ​91) were collected from the Chinese ophthalmology senior professional title examination and categorized into six topics. These MCQs were translated into English. Responses from DeepSeek-R1, Gemini 2.0 Pro, OpenAI o1 and o3-mini were generated under default configurations between February 15 and February 20, 2025. Accuracy was calculated as the proportion of correctly answered questions, with omissions and extra answers considered incorrect. Reasoning ability was evaluated through analyzing reasoning logic and the causes of reasoning errors.</div></div><div><h3>Results</h3><div>DeepSeek-R1 demonstrated the highest overall accuracy, achieving 0.862 in Chinese MCQs and 0.808 in English MCQs. Gemini 2.0 Pro, OpenAI o1, and OpenAI o3-mini attained accuracies of 0.715, 0.685, and 0.692 in Chinese MCQs (all <em>P</em> ​&lt;0.001 compared with DeepSeek-R1), and 0.746 (<em>P</em> ​= ​0.115), 0.723 (<em>P</em> ​= ​0.027), and 0.577 (<em>P</em> ​&lt;0.001) in English MCQs, respectively. DeepSeek-R1 achieved the highest accuracy across five topics in both Chinese and English MCQs. It also excelled in management questions conducted in Chinese (all <em>P</em> ​&lt;0.05). Reasoning ability analysis showed that the four LLMs shared similar reasoning logic. Ignoring key positive history, ignoring key positive signs, misinterpretation of medical data, and overuse of non–first-line interventions were the most common causes of reasoning errors.</div></div><div><h3>Conclusions</h3><div>DeepSeek-R1 demonstrated superior performance in bilingual complex ophthalmology reasoning tasks than three state-of-the-art LLMs. These findings highlight the potential of advanced LLMs to assist in clinical decision-making and suggest a framework for evaluating reasoning capabilities.</div></div>","PeriodicalId":72103,"journal":{"name":"Advances in ophthalmology practice and research","volume":"5 3","pages":"Pages 189-195"},"PeriodicalIF":3.4000,"publicationDate":"2025-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advances in ophthalmology practice and research","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2667376225000290","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Purpose

To evaluate the accuracy and reasoning ability of DeepSeek-R1 and three recently released large language models (LLMs) in bilingual complex ophthalmology cases.

Methods

A total of 130 multiple-choice questions (MCQs) related to diagnosis (n ​= ​39) and management (n ​= ​91) were collected from the Chinese ophthalmology senior professional title examination and categorized into six topics. These MCQs were translated into English. Responses from DeepSeek-R1, Gemini 2.0 Pro, OpenAI o1 and o3-mini were generated under default configurations between February 15 and February 20, 2025. Accuracy was calculated as the proportion of correctly answered questions, with omissions and extra answers considered incorrect. Reasoning ability was evaluated through analyzing reasoning logic and the causes of reasoning errors.

Results

DeepSeek-R1 demonstrated the highest overall accuracy, achieving 0.862 in Chinese MCQs and 0.808 in English MCQs. Gemini 2.0 Pro, OpenAI o1, and OpenAI o3-mini attained accuracies of 0.715, 0.685, and 0.692 in Chinese MCQs (all P ​<0.001 compared with DeepSeek-R1), and 0.746 (P ​= ​0.115), 0.723 (P ​= ​0.027), and 0.577 (P ​<0.001) in English MCQs, respectively. DeepSeek-R1 achieved the highest accuracy across five topics in both Chinese and English MCQs. It also excelled in management questions conducted in Chinese (all P ​<0.05). Reasoning ability analysis showed that the four LLMs shared similar reasoning logic. Ignoring key positive history, ignoring key positive signs, misinterpretation of medical data, and overuse of non–first-line interventions were the most common causes of reasoning errors.

Conclusions

DeepSeek-R1 demonstrated superior performance in bilingual complex ophthalmology reasoning tasks than three state-of-the-art LLMs. These findings highlight the potential of advanced LLMs to assist in clinical decision-making and suggest a framework for evaluating reasoning capabilities.
DeepSeek-R1在双语复杂眼科推理方面优于Gemini 2.0 Pro、OpenAI o1和o3-mini
目的评价DeepSeek-R1和最近发布的三种大型语言模型(llm)在复杂的双语眼科病例中的准确性和推理能力。方法收集中国眼科高级职称考试中与诊断(n = 39)和管理(n = 91)相关的多项选择题130道,分为6个题目。这些mcq被翻译成英文。DeepSeek-R1、Gemini 2.0 Pro、OpenAI o1和o3-mini在2025年2月15日至2月20日之间的默认配置下生成响应。准确度是根据正确回答问题的比例来计算的,遗漏和多余的答案被认为是不正确的。通过分析推理逻辑和推理错误的原因来评价推理能力。结果deepseek - r1在中文mcq和英文mcq中表现出最高的总体准确率,分别达到0.862和0.808。Gemini 2.0 Pro、OpenAI 01和OpenAI 03 -mini在中文mcq中的准确率分别为0.715、0.685和0.692(与DeepSeek-R1相比均P <;0.001),在英文mcq中的准确率分别为0.746 (P = 0.115)、0.723 (P = 0.027)和0.577 (P <0.001)。DeepSeek-R1在中文和英文mcq的五个主题中都取得了最高的准确率。它在中文管理问题上也表现出色(P <0.05)。推理能力分析表明,四个llm具有相似的推理逻辑。忽视关键的阳性病史、忽视关键的阳性体征、对医疗数据的误解以及过度使用非一线干预措施是导致推理错误的最常见原因。结论deepseek - r1在复杂的双语眼科推理任务中表现优于3种最先进的llm。这些发现强调了高级llm在临床决策方面的潜力,并提出了一个评估推理能力的框架。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
1.70
自引率
0.00%
发文量
0
审稿时长
66 days
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信