Ryan Shean BA , Tathya Shah BS , Sina Sobhani BS , Alan Tang BS , Ali Setayesh BA , Kyle Bolo MD , Van Nguyen MD , Benjamin Xu MD, PhD
{"title":"OpenAI o1 Large Language Model Outperforms GPT-4o, Gemini 1.5 Flash, and Human Test Takers on Ophthalmology Board–Style Questions","authors":"Ryan Shean BA , Tathya Shah BS , Sina Sobhani BS , Alan Tang BS , Ali Setayesh BA , Kyle Bolo MD , Van Nguyen MD , Benjamin Xu MD, PhD","doi":"10.1016/j.xops.2025.100844","DOIUrl":null,"url":null,"abstract":"<div><h3>Purpose</h3><div>To evaluate and compare the performance of human test takers and three artificial intelligence (AI) models—OpenAI o1, ChatGPT-4o, and Gemini 1.5 Flash—on ophthalmology board–style questions, focusing on overall accuracy and performance stratified by ophthalmic subspecialty and cognitive complexity level.</div></div><div><h3>Design</h3><div>A cross-sectional study.</div></div><div><h3>Subjects</h3><div>Five hundred questions sourced from the <em>Basic and Clinical Science Course (BCSC)</em> and <em>EyeQuiz</em> question banks.</div></div><div><h3>Methods</h3><div>Three large language models interpreted the questions using standardized prompting procedures. Subanalysis was performed, stratifying the questions by subspecialty and complexity defined by the Buckwalter taxonomic schema. Statistical analysis, including the analysis of variance and McNemar test, was conducted to assess performance differences.</div></div><div><h3>Main Outcome Measures</h3><div>Accuracy of responses for each model and human test takers, stratified by subspecialty and cognitive complexity.</div></div><div><h3>Results</h3><div>OpenAI o1 achieved the highest overall accuracy (423/500, 84.6%), significantly outperforming GPT-4o (331/500, 66.2%; <em>P</em> < 0.001) and Gemini (301/500, 60.2%; <em>P</em> < 0.001). o1 demonstrated superior performance on both <em>BCSC</em> (228/250, 91.2%) and <em>EyeQuiz</em> (195/250, 78.0%) questions compared with GPT-4o (<em>BCSC</em>: 183/250, 73.2%; <em>EyeQuiz</em>: 148/250, 59.2%) and Gemini (<em>BCSC</em>: 163/250, 65.2%; <em>EyeQuiz</em>: 137/250, 54.8%). On <em>BCSC</em> questions, human performance was lower (64.5%) than Gemini 1.5 Flash (65.2%), GPT-4o (73.2%), and OpenAI o1 (91.2%) (<em>P</em> < 0.001). OpenAI o1 outperformed other models in each of the nine ophthalmic subfields and three cognitive complexity levels.</div></div><div><h3>Conclusions</h3><div>OpenAI o1 outperformed GPT-4o, Gemini, and human test takers in answering ophthalmology board–style questions from two question banks and across three complexity levels. These findings highlight advances in AI technology and OpenAI o1’s growing potential as an adjunct in ophthalmic education and care.</div></div><div><h3>Financial Disclosure(s)</h3><div>The author(s) have no proprietary or commercial interest in any materials discussed in this article.</div></div>","PeriodicalId":74363,"journal":{"name":"Ophthalmology science","volume":"5 6","pages":"Article 100844"},"PeriodicalIF":3.2000,"publicationDate":"2025-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ophthalmology science","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666914525001423","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"OPHTHALMOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Purpose
To evaluate and compare the performance of human test takers and three artificial intelligence (AI) models—OpenAI o1, ChatGPT-4o, and Gemini 1.5 Flash—on ophthalmology board–style questions, focusing on overall accuracy and performance stratified by ophthalmic subspecialty and cognitive complexity level.
Design
A cross-sectional study.
Subjects
Five hundred questions sourced from the Basic and Clinical Science Course (BCSC) and EyeQuiz question banks.
Methods
Three large language models interpreted the questions using standardized prompting procedures. Subanalysis was performed, stratifying the questions by subspecialty and complexity defined by the Buckwalter taxonomic schema. Statistical analysis, including the analysis of variance and McNemar test, was conducted to assess performance differences.
Main Outcome Measures
Accuracy of responses for each model and human test takers, stratified by subspecialty and cognitive complexity.
Results
OpenAI o1 achieved the highest overall accuracy (423/500, 84.6%), significantly outperforming GPT-4o (331/500, 66.2%; P < 0.001) and Gemini (301/500, 60.2%; P < 0.001). o1 demonstrated superior performance on both BCSC (228/250, 91.2%) and EyeQuiz (195/250, 78.0%) questions compared with GPT-4o (BCSC: 183/250, 73.2%; EyeQuiz: 148/250, 59.2%) and Gemini (BCSC: 163/250, 65.2%; EyeQuiz: 137/250, 54.8%). On BCSC questions, human performance was lower (64.5%) than Gemini 1.5 Flash (65.2%), GPT-4o (73.2%), and OpenAI o1 (91.2%) (P < 0.001). OpenAI o1 outperformed other models in each of the nine ophthalmic subfields and three cognitive complexity levels.
Conclusions
OpenAI o1 outperformed GPT-4o, Gemini, and human test takers in answering ophthalmology board–style questions from two question banks and across three complexity levels. These findings highlight advances in AI technology and OpenAI o1’s growing potential as an adjunct in ophthalmic education and care.
Financial Disclosure(s)
The author(s) have no proprietary or commercial interest in any materials discussed in this article.