OpenAI o1 Large Language Model Outperforms GPT-4o, Gemini 1.5 Flash, and Human Test Takers on Ophthalmology Board–Style Questions

IF 3.2 Q1 OPHTHALMOLOGY
Ryan Shean BA , Tathya Shah BS , Sina Sobhani BS , Alan Tang BS , Ali Setayesh BA , Kyle Bolo MD , Van Nguyen MD , Benjamin Xu MD, PhD
{"title":"OpenAI o1 Large Language Model Outperforms GPT-4o, Gemini 1.5 Flash, and Human Test Takers on Ophthalmology Board–Style Questions","authors":"Ryan Shean BA ,&nbsp;Tathya Shah BS ,&nbsp;Sina Sobhani BS ,&nbsp;Alan Tang BS ,&nbsp;Ali Setayesh BA ,&nbsp;Kyle Bolo MD ,&nbsp;Van Nguyen MD ,&nbsp;Benjamin Xu MD, PhD","doi":"10.1016/j.xops.2025.100844","DOIUrl":null,"url":null,"abstract":"<div><h3>Purpose</h3><div>To evaluate and compare the performance of human test takers and three artificial intelligence (AI) models—OpenAI o1, ChatGPT-4o, and Gemini 1.5 Flash—on ophthalmology board–style questions, focusing on overall accuracy and performance stratified by ophthalmic subspecialty and cognitive complexity level.</div></div><div><h3>Design</h3><div>A cross-sectional study.</div></div><div><h3>Subjects</h3><div>Five hundred questions sourced from the <em>Basic and Clinical Science Course (BCSC)</em> and <em>EyeQuiz</em> question banks.</div></div><div><h3>Methods</h3><div>Three large language models interpreted the questions using standardized prompting procedures. Subanalysis was performed, stratifying the questions by subspecialty and complexity defined by the Buckwalter taxonomic schema. Statistical analysis, including the analysis of variance and McNemar test, was conducted to assess performance differences.</div></div><div><h3>Main Outcome Measures</h3><div>Accuracy of responses for each model and human test takers, stratified by subspecialty and cognitive complexity.</div></div><div><h3>Results</h3><div>OpenAI o1 achieved the highest overall accuracy (423/500, 84.6%), significantly outperforming GPT-4o (331/500, 66.2%; <em>P</em> &lt; 0.001) and Gemini (301/500, 60.2%; <em>P</em> &lt; 0.001). o1 demonstrated superior performance on both <em>BCSC</em> (228/250, 91.2%) and <em>EyeQuiz</em> (195/250, 78.0%) questions compared with GPT-4o (<em>BCSC</em>: 183/250, 73.2%; <em>EyeQuiz</em>: 148/250, 59.2%) and Gemini (<em>BCSC</em>: 163/250, 65.2%; <em>EyeQuiz</em>: 137/250, 54.8%). On <em>BCSC</em> questions, human performance was lower (64.5%) than Gemini 1.5 Flash (65.2%), GPT-4o (73.2%), and OpenAI o1 (91.2%) (<em>P</em> &lt; 0.001). OpenAI o1 outperformed other models in each of the nine ophthalmic subfields and three cognitive complexity levels.</div></div><div><h3>Conclusions</h3><div>OpenAI o1 outperformed GPT-4o, Gemini, and human test takers in answering ophthalmology board–style questions from two question banks and across three complexity levels. These findings highlight advances in AI technology and OpenAI o1’s growing potential as an adjunct in ophthalmic education and care.</div></div><div><h3>Financial Disclosure(s)</h3><div>The author(s) have no proprietary or commercial interest in any materials discussed in this article.</div></div>","PeriodicalId":74363,"journal":{"name":"Ophthalmology science","volume":"5 6","pages":"Article 100844"},"PeriodicalIF":3.2000,"publicationDate":"2025-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ophthalmology science","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666914525001423","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"OPHTHALMOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Purpose

To evaluate and compare the performance of human test takers and three artificial intelligence (AI) models—OpenAI o1, ChatGPT-4o, and Gemini 1.5 Flash—on ophthalmology board–style questions, focusing on overall accuracy and performance stratified by ophthalmic subspecialty and cognitive complexity level.

Design

A cross-sectional study.

Subjects

Five hundred questions sourced from the Basic and Clinical Science Course (BCSC) and EyeQuiz question banks.

Methods

Three large language models interpreted the questions using standardized prompting procedures. Subanalysis was performed, stratifying the questions by subspecialty and complexity defined by the Buckwalter taxonomic schema. Statistical analysis, including the analysis of variance and McNemar test, was conducted to assess performance differences.

Main Outcome Measures

Accuracy of responses for each model and human test takers, stratified by subspecialty and cognitive complexity.

Results

OpenAI o1 achieved the highest overall accuracy (423/500, 84.6%), significantly outperforming GPT-4o (331/500, 66.2%; P < 0.001) and Gemini (301/500, 60.2%; P < 0.001). o1 demonstrated superior performance on both BCSC (228/250, 91.2%) and EyeQuiz (195/250, 78.0%) questions compared with GPT-4o (BCSC: 183/250, 73.2%; EyeQuiz: 148/250, 59.2%) and Gemini (BCSC: 163/250, 65.2%; EyeQuiz: 137/250, 54.8%). On BCSC questions, human performance was lower (64.5%) than Gemini 1.5 Flash (65.2%), GPT-4o (73.2%), and OpenAI o1 (91.2%) (P < 0.001). OpenAI o1 outperformed other models in each of the nine ophthalmic subfields and three cognitive complexity levels.

Conclusions

OpenAI o1 outperformed GPT-4o, Gemini, and human test takers in answering ophthalmology board–style questions from two question banks and across three complexity levels. These findings highlight advances in AI technology and OpenAI o1’s growing potential as an adjunct in ophthalmic education and care.

Financial Disclosure(s)

The author(s) have no proprietary or commercial interest in any materials discussed in this article.
OpenAI 01大型语言模型在眼科考题上的表现优于gpt - 40、Gemini 1.5 Flash和人类考生
目的评估和比较人类考生与三种人工智能(AI)模型(openai 01、chatgpt - 40和Gemini 1.5)的答题表现,重点关注按眼科亚专科和认知复杂程度分层的整体准确性和表现。设计横断面研究。来自基础和临床科学课程(BCSC)和EyeQuiz题库的500个问题。方法三种大型语言模型采用标准化提示程序对问题进行翻译。根据Buckwalter分类模式定义的亚专业和复杂性对问题进行亚分析。采用统计分析,包括方差分析和McNemar检验来评估成绩差异。主要结果测量:每个模型和人类测试者的反应准确性,按亚专业和认知复杂性分层。结果openai 01总体准确率最高(423/500,84.6%),显著优于gpt - 40 (331/500, 66.2%;P & lt;0.001)和Gemini (301/500, 60.2%;P & lt;0.001)。与gpt - 40 (BCSC: 183/250, 73.2%)相比,o1在BCSC(228/250, 91.2%)和EyeQuiz(195/250, 78.0%)问题上都表现出了更好的表现;EyeQuiz: 148/250, 59.2%)和Gemini (BCSC: 163/250, 65.2%;EyeQuiz: 137/250, 54.8%)。在BCSC问题上,人类的表现(64.5%)低于Gemini 1.5 Flash(65.2%)、gpt - 40(73.2%)和OpenAI 01 (91.2%) (P <;0.001)。OpenAI 01在九个眼科子领域和三个认知复杂性水平上的表现都优于其他模型。结论:openai 01在回答来自两个题库和三个复杂级别的眼科板式问题方面优于gpt - 40、Gemini和人类考生。这些发现凸显了人工智能技术的进步,以及OpenAI 01作为眼科教育和护理辅助工具的日益增长的潜力。财务披露作者在本文中讨论的任何材料中没有专有或商业利益。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Ophthalmology science
Ophthalmology science Ophthalmology
CiteScore
3.40
自引率
0.00%
发文量
0
审稿时长
89 days
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信