Ophthalmological Question Answering and Reasoning Using OpenAI o1 vs Other Large Language Models

IF 9.2 1区医学 Q1 OPHTHALMOLOGY

JAMA ophthalmology Pub Date : 2025-07-31 DOI:10.1001/jamaophthalmol.2025.2413

Sahana Srinivasan, Xuguang Ai, Minjie Zou, Ke Zou, Hyunjae Kim, Thaddaeus Wai Soon Lo, Krithi Pushpanathan, Gabriel Dawei Yang, Jocelyn Hui Lin Goh, Yiming Kong, Anran Li, Maxwell B. Singer, Kai Jin, Fares Antaki, David Ziyou Chen, Dianbo Liu, Ron A. Adelman, Qingyu Chen, Yih Chung Tham

{"title":"Ophthalmological Question Answering and Reasoning Using OpenAI o1 vs Other Large Language Models","authors":"Sahana Srinivasan, Xuguang Ai, Minjie Zou, Ke Zou, Hyunjae Kim, Thaddaeus Wai Soon Lo, Krithi Pushpanathan, Gabriel Dawei Yang, Jocelyn Hui Lin Goh, Yiming Kong, Anran Li, Maxwell B. Singer, Kai Jin, Fares Antaki, David Ziyou Chen, Dianbo Liu, Ron A. Adelman, Qingyu Chen, Yih Chung Tham","doi":"10.1001/jamaophthalmol.2025.2413","DOIUrl":null,"url":null,"abstract":"ImportanceOpenAI’s recent large language model (LLM) o1 has dedicated reasoning capabilities, but it remains untested in specialized medical fields like ophthalmology. Evaluating o1 in ophthalmology is crucial to determine whether its general reasoning can meet specialized needs or if domain-specific LLMs are warranted.ObjectiveTo assess the performance and reasoning ability of OpenAI’s o1 compared with other LLMs on ophthalmological questions.Design, Setting, and ParticipantsIn September through October 2024, the LLMs o1, GPT-4o (OpenAI), GPT-4 (OpenAI), GPT-3.5 (OpenAI), Llama 3-8B (Meta), and Gemini 1.5 Pro (Google) were evaluated on 6990 standardized ophthalmology questions from the Medical Multiple-Choice Question Answering (MedMCQA) dataset. The study did not analyze human participants.Main Outcomes and MeasuresModels were evaluated on performance (accuracy and macro F1 score) and reasoning abilities (text-generation metrics: Recall-Oriented Understudy for Gisting Evaluation [ROUGE-L], BERTScore, BARTScore, AlignScore, and Metric for Evaluation of Translation With Explicit Ordering [METEOR]). Mean scores are reported for o1, while mean differences (Δ) from o1’s scores are reported for other models. Expert qualitative evaluation of o1 and GPT-4o responses assessed usefulness, organization, and comprehensibility using 5-point Likert scales.ResultsThe LLM o1 achieved the highest accuracy (mean, 0.877; 95% CI, 0.870 to 0.885) and macro F1 score (mean, 0.877; 95% CI, 0.869 to 0.884) (P &amp;lt; .001). In BERTScore, GPT-4o (Δ = 0.012; 95% CI, 0.012 to 0.013) and GPT-4 (Δ = 0.014; 95% CI, 0.014 to 0.015) outperformed o1 (P &amp;lt; .001). Similarly, in AlignScore, GPT-4o (Δ = 0.019; 95% CI, 0.016 to 0.021) and GPT-4 (Δ = 0.024; 95% CI, 0.021 to 0.026) again performed better (P &amp;lt; .001). In ROUGE-L, GPT-4o (Δ = 0.018; 95% CI, 0.017 to 0.019), GPT-4 (Δ = 0.026; 95% CI, 0.025 to 0.027), and GPT-3.5 (Δ = 0.008; 95% CI, 0.007 to 0.009) all outperformed o1 (P &amp;lt; .001). Conversely, o1 led in BARTScore (mean, –4.787; 95% CI, –4.813 to –4.762; P &amp;lt; .001) and METEOR (mean, 0.221; 95% CI, 0.218 to 0.223; P &amp;lt; .001 except GPT-4o). Also, o1 outperformed GPT-4o in usefulness (o1: mean, 4.81; 95% CI, 4.73 to 4.89; GPT-4o: mean, 4.53; 95% CI, 4.40 to 4.65; P &amp;lt; .001) and organization (o1: mean, 4.83; 95% CI, 4.75 to 4.90; GPT-4o: mean, 4.63; 95% CI, 4.51 to 4.74; P = .003).Conclusions and RelevanceThis study found that o1 excelled in accuracy but showed inconsistencies in text-generation metrics, trailing GPT-4o and GPT-4; expert reviews found o1’s responses to be more clinically useful and better organized than GPT-4o. While o1 demonstrated promise, its performance in addressing ophthalmology-specific challenges is not fully optimal, underscoring the potential need for domain-specialized LLMs and targeted evaluations.","PeriodicalId":14518,"journal":{"name":"JAMA ophthalmology","volume":"27 1","pages":""},"PeriodicalIF":9.2000,"publicationDate":"2025-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JAMA ophthalmology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1001/jamaophthalmol.2025.2413","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"OPHTHALMOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

ImportanceOpenAI’s recent large language model (LLM) o1 has dedicated reasoning capabilities, but it remains untested in specialized medical fields like ophthalmology. Evaluating o1 in ophthalmology is crucial to determine whether its general reasoning can meet specialized needs or if domain-specific LLMs are warranted.ObjectiveTo assess the performance and reasoning ability of OpenAI’s o1 compared with other LLMs on ophthalmological questions.Design, Setting, and ParticipantsIn September through October 2024, the LLMs o1, GPT-4o (OpenAI), GPT-4 (OpenAI), GPT-3.5 (OpenAI), Llama 3-8B (Meta), and Gemini 1.5 Pro (Google) were evaluated on 6990 standardized ophthalmology questions from the Medical Multiple-Choice Question Answering (MedMCQA) dataset. The study did not analyze human participants.Main Outcomes and MeasuresModels were evaluated on performance (accuracy and macro F1 score) and reasoning abilities (text-generation metrics: Recall-Oriented Understudy for Gisting Evaluation [ROUGE-L], BERTScore, BARTScore, AlignScore, and Metric for Evaluation of Translation With Explicit Ordering [METEOR]). Mean scores are reported for o1, while mean differences (Δ) from o1’s scores are reported for other models. Expert qualitative evaluation of o1 and GPT-4o responses assessed usefulness, organization, and comprehensibility using 5-point Likert scales.ResultsThe LLM o1 achieved the highest accuracy (mean, 0.877; 95% CI, 0.870 to 0.885) and macro F1 score (mean, 0.877; 95% CI, 0.869 to 0.884) (P &lt; .001). In BERTScore, GPT-4o (Δ = 0.012; 95% CI, 0.012 to 0.013) and GPT-4 (Δ = 0.014; 95% CI, 0.014 to 0.015) outperformed o1 (P &lt; .001). Similarly, in AlignScore, GPT-4o (Δ = 0.019; 95% CI, 0.016 to 0.021) and GPT-4 (Δ = 0.024; 95% CI, 0.021 to 0.026) again performed better (P &lt; .001). In ROUGE-L, GPT-4o (Δ = 0.018; 95% CI, 0.017 to 0.019), GPT-4 (Δ = 0.026; 95% CI, 0.025 to 0.027), and GPT-3.5 (Δ = 0.008; 95% CI, 0.007 to 0.009) all outperformed o1 (P &lt; .001). Conversely, o1 led in BARTScore (mean, –4.787; 95% CI, –4.813 to –4.762; P &lt; .001) and METEOR (mean, 0.221; 95% CI, 0.218 to 0.223; P &lt; .001 except GPT-4o). Also, o1 outperformed GPT-4o in usefulness (o1: mean, 4.81; 95% CI, 4.73 to 4.89; GPT-4o: mean, 4.53; 95% CI, 4.40 to 4.65; P &lt; .001) and organization (o1: mean, 4.83; 95% CI, 4.75 to 4.90; GPT-4o: mean, 4.63; 95% CI, 4.51 to 4.74; P = .003).Conclusions and RelevanceThis study found that o1 excelled in accuracy but showed inconsistencies in text-generation metrics, trailing GPT-4o and GPT-4; expert reviews found o1’s responses to be more clinically useful and better organized than GPT-4o. While o1 demonstrated promise, its performance in addressing ophthalmology-specific challenges is not fully optimal, underscoring the potential need for domain-specialized LLMs and targeted evaluations.

查看原文本刊更多论文

使用OpenAI o1与其他大型语言模型的眼科问答和推理

openai最近的大型语言模型（LLM） 1具有专门的推理能力，但它在眼科等专业医疗领域仍未经过测试。评估眼科学中的o1对于确定其一般推理是否能够满足专业需求或是否需要特定领域的llm至关重要。目的比较OpenAI的o1与其他llm在眼科问题上的表现和推理能力。设计、设置和参与者在2024年9月至10月期间，对LLMs 01、gpt - 40 （OpenAI）、GPT-4 （OpenAI）、GPT-3.5 （OpenAI）、Llama 3-8B （Meta）和Gemini 1.5 Pro（谷歌）进行了来自医学选择题回答（MedMCQA）数据集中的6990个标准化眼科问题进行了评估。这项研究没有分析人类参与者。主要结果和测量方法对模型的性能（准确性和宏观F1分数）和推理能力（文本生成指标：面向记忆的注册评估替代研究[ROUGE-L]、BERTScore、BARTScore、AlignScore和明确排序翻译评估指标[METEOR]）进行评估。报告了0的平均分数，而报告了其他模型与0分数的平均差异（Δ）。专家对o1和gpt - 40反应进行定性评估，使用5点李克特量表评估有用性、组织性和可理解性。结果llm01的准确率最高(平均0.877；95% CI, 0.870 ~ 0.885)和宏观F1评分(平均值，0.877；95% CI, 0.869 ~ 0.884) (P &lt；措施)。在BERTScore中，gpt - 40 (Δ = 0.012；95% CI, 0.012 ~ 0.013)和GPT-4 (Δ = 0.014；95% CI, 0.014 ~ 0.015)优于1 (P &lt；措施)。同样，在AlignScore中，gpt - 40 (Δ = 0.019；95% CI, 0.016 ~ 0.021)和GPT-4 (Δ = 0.024；95% CI， 0.021至0.026)再次表现更好(P &lt；措施)。在ROUGE-L中，gpt - 40 (Δ = 0.018；95% CI, 0.017 ~ 0.019), GPT-4 (Δ = 0.026；95% CI, 0.025 ~ 0.027), GPT-3.5 (Δ = 0.008；95% CI， 0.007至0.009)均优于1 (P &lt；措施)。相反，在BARTScore(平均值，-4.787；95% CI, -4.813 ~ -4.762；P, amp;肝移植;.001)和METEOR(平均0.221；95% CI, 0.218 ~ 0.223；P, amp;肝移植;.001 （gpt - 40除外）。此外，01在有用性方面优于gpt - 40(01：平均值，4.81；95% CI, 4.73 ~ 4.89；gpt - 40：平均4.53；95% CI, 4.40 ~ 4.65；P, amp;肝移植;.001)和组织(0.01：平均值，4.83；95% CI, 4.75 ~ 4.90；gpt - 40：平均4.63；95% CI, 4.51 ~ 4.74；P = .003)。结论和相关性本研究发现01在准确性方面表现出色，但在文本生成指标上表现不一致，落后于gpt - 40和GPT-4；专家评审发现，与gpt - 40相比，o1的反应在临床上更有用，也更有条理。虽然o1表现出了希望，但它在解决眼科特定挑战方面的表现并不是完全最佳的，这强调了对领域专业法学硕士和有针对性评估的潜在需求。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

JAMA ophthalmology OPHTHALMOLOGY-

CiteScore

13.20

自引率

3.70%

发文量

340

期刊介绍： JAMA Ophthalmology, with a rich history of continuous publication since 1869, stands as a distinguished international, peer-reviewed journal dedicated to ophthalmology and visual science. In 2019, the journal proudly commemorated 150 years of uninterrupted service to the field. As a member of the esteemed JAMA Network, a consortium renowned for its peer-reviewed general medical and specialty publications, JAMA Ophthalmology upholds the highest standards of excellence in disseminating cutting-edge research and insights. Join us in celebrating our legacy and advancing the frontiers of ophthalmology and visual science.