Bridging gaps in ophthalmology education through large language models

AJO International Pub Date : 2025-08-23 DOI:10.1016/j.ajoint.2025.100166

Shahrzad Gholami , Beth Wilson , Sarah Page , Daniel B. Mummert , Joseph Carr , Robert R. McNabb , Rahul Dodhia , Juan M. Lavista Ferres , William B. Weeks , Dale E. Fajardo , Karine D. Bojikian

{"title":"Bridging gaps in ophthalmology education through large language models","authors":"Shahrzad Gholami , Beth Wilson , Sarah Page , Daniel B. Mummert , Joseph Carr , Robert R. McNabb , Rahul Dodhia , Juan M. Lavista Ferres , William B. Weeks , Dale E. Fajardo , Karine D. Bojikian","doi":"10.1016/j.ajoint.2025.100166","DOIUrl":null,"url":null,"abstract":"<div><h3>Purpose</h3><div>To assess the performance of general-domain large language models (LLMs), particularly OpenAI’s Generative Pre-trained Transformer (GPT) models, within the American Academy of Ophthalmology (AAO) Self-Assessment Program, which is based on AAO’s Basic and Clinical Science Course.</div></div><div><h3>Methods</h3><div>We input 3357 questions into GPT-4o, GPT-4-Turbo, o1 and o3-mini via Microsoft’s Azure OpenAI Service using zero-shot and chain-of-thought (CoT) prompting. Questions with images were analyzed using the multimodal version of GPT-4o and GPT-4.1. The performance of the LLMs was compared to 1371 unique residents who had previously participated in the program. Additionally, we compared the performance on 1399 questions, including information on 3 question types: recall, interpretation, and decision-making or clinical management. Average accuracy rates were used to evaluate performance and compare statistical significance across categories.</div></div><div><h3>Results</h3><div>o1 (CoT) was the most accurate model (95% confidence interval [CI]: 90.3%–92.1%) with performance ranging from 95.17% (general medicine) to 86.9% (cornea) and 91.1% accuracy on a synthesized sample test. It also outperformed residents in recall-type, interpretation-type, and decision-making or clinical management questions (95.7%, 85.3%, and 90.8%, respectively, <em>P</em> < 0.001). Third-year residents were more accurate than first-year or second-year residents (78.2%, 68.3%, 74.9%, respectively). On multimodal inputs, adding images improved the model’s accuracy but all models still underperformed compared to residents.</div></div><div><h3>Conclusions</h3><div>The accuracy of the LLMs models continues to improve, with o1 (CoT) showing the highest overall performance. Multimodal inputs can enhance model accuracy, but current models still need improvement. LLMs shows great potential in democratizing access to high-quality medical knowledge.</div></div>","PeriodicalId":100071,"journal":{"name":"AJO International","volume":"2 4","pages":"Article 100166"},"PeriodicalIF":0.0000,"publicationDate":"2025-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"AJO International","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S295025352500070X","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose

To assess the performance of general-domain large language models (LLMs), particularly OpenAI’s Generative Pre-trained Transformer (GPT) models, within the American Academy of Ophthalmology (AAO) Self-Assessment Program, which is based on AAO’s Basic and Clinical Science Course.

Methods

We input 3357 questions into GPT-4o, GPT-4-Turbo, o1 and o3-mini via Microsoft’s Azure OpenAI Service using zero-shot and chain-of-thought (CoT) prompting. Questions with images were analyzed using the multimodal version of GPT-4o and GPT-4.1. The performance of the LLMs was compared to 1371 unique residents who had previously participated in the program. Additionally, we compared the performance on 1399 questions, including information on 3 question types: recall, interpretation, and decision-making or clinical management. Average accuracy rates were used to evaluate performance and compare statistical significance across categories.

Results

o1 (CoT) was the most accurate model (95% confidence interval [CI]: 90.3%–92.1%) with performance ranging from 95.17% (general medicine) to 86.9% (cornea) and 91.1% accuracy on a synthesized sample test. It also outperformed residents in recall-type, interpretation-type, and decision-making or clinical management questions (95.7%, 85.3%, and 90.8%, respectively, P < 0.001). Third-year residents were more accurate than first-year or second-year residents (78.2%, 68.3%, 74.9%, respectively). On multimodal inputs, adding images improved the model’s accuracy but all models still underperformed compared to residents.

Conclusions

The accuracy of the LLMs models continues to improve, with o1 (CoT) showing the highest overall performance. Multimodal inputs can enhance model accuracy, but current models still need improvement. LLMs shows great potential in democratizing access to high-quality medical knowledge.

查看原文本刊更多论文

通过大型语言模型弥合眼科教育的差距

目的在美国眼科学会（AAO）自我评估计划中评估通用领域大型语言模型（llm）的性能，特别是OpenAI的生成预训练转换器（GPT）模型，该计划基于AAO的基础和临床科学课程。方法通过微软Azure OpenAI服务，采用零射击和思维链（CoT）提示方式，在gpt - 40、GPT-4-Turbo、o1和o3-mini中输入3357个问题。使用gpt - 40和GPT-4.1的多模态版本分析带有图像的问题。法学硕士的表现与之前参加该计划的1371名独特的居民进行了比较。此外，我们比较了1399个问题的表现，包括3个问题类型的信息：回忆、解释和决策或临床管理。平均准确率用于评估性能并比较不同类别的统计显著性。结果tso1 （CoT）是最准确的模型（95%置信区间[CI]: 90.3% ~ 92.1%），在综合样本检验中准确率为95.17%（普通医学）~ 86.9%（角膜），准确率为91.1%。它在回忆型、解释型、决策或临床管理问题上也优于住院医生（分别为95.7%、85.3%和90.8%,P < 0.001）。第三年住院医师比第一年或第二年住院医师更准确（分别为78.2%、68.3%和74.9%）。在多模式输入中，添加图像提高了模型的准确性，但与居民相比，所有模型的表现仍然较差。结论LLMs模型的准确性不断提高，其中0.1 （CoT）的综合性能最高。多模态输入可以提高模型的精度，但目前的模型仍有待改进。法学硕士在实现高质量医学知识的民主化方面显示出巨大潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

AJO International

自引率

0.00%

发文量