OpenAI 的新型 o1 模型能否在常见的眼科护理查询中胜过其前辈？

IF 3.2 Q1 OPHTHALMOLOGY

Ophthalmology science Pub Date : 2025-02-22 DOI:10.1016/j.xops.2025.100745

Krithi Pushpanathan MSc , Minjie Zou MMed , Sahana Srinivasan BEng , Wendy Meihua Wong MMed , Erlangga Ariadarma Mangunkusumo MD , George Naveen Thomas MMed , Yien Lai MMed , Chen-Hsin Sun MD , Janice Sing Harn Lam MMed , Marcus Chun Jin Tan MMed , Hazel Anne Hui'En Lin MMed , Weizhi Ma PhD , Victor Teck Chang Koh MMed , David Ziyou Chen MMed , Yih-Chung Tham PhD

{"title":"OpenAI 的新型 o1 模型能否在常见的眼科护理查询中胜过其前辈？","authors":"Krithi Pushpanathan MSc , Minjie Zou MMed , Sahana Srinivasan BEng , Wendy Meihua Wong MMed , Erlangga Ariadarma Mangunkusumo MD , George Naveen Thomas MMed , Yien Lai MMed , Chen-Hsin Sun MD , Janice Sing Harn Lam MMed , Marcus Chun Jin Tan MMed , Hazel Anne Hui'En Lin MMed , Weizhi Ma PhD , Victor Teck Chang Koh MMed , David Ziyou Chen MMed , Yih-Chung Tham PhD","doi":"10.1016/j.xops.2025.100745","DOIUrl":null,"url":null,"abstract":"<div><h3>Objective</h3><div>The newly launched OpenAI o1 is said to offer improved reasoning, potentially providing higher quality responses to eye care queries. However, its performance remains unassessed. We evaluated the performance of o1, ChatGPT-4o, and ChatGPT-4 in addressing ophthalmic-related queries, focusing on correctness, completeness, and readability.</div></div><div><h3>Design</h3><div>Cross-sectional study.</div></div><div><h3>Subjects</h3><div>Sixteen queries, previously identified as suboptimally responded to by ChatGPT-4 from prior studies, were used, covering 3 subtopics: myopia (6 questions), ocular symptoms (4 questions), and retinal conditions (6 questions).</div></div><div><h3>Methods</h3><div>For each subtopic, 3 attending-level ophthalmologists, masked to the model sources, evaluated the responses based on correctness, completeness, and readability (on a 5-point scale for each metric).</div></div><div><h3>Main Outcome Measures</h3><div>Mean summed scores of each model for correctness, completeness, and readability, rated on a 5-point scale (maximum score: 15).</div></div><div><h3>Results</h3><div>O1 scored highest in correctness (12.6) and readability (14.2), outperforming ChatGPT-4, which scored 10.3 (<em>P</em> = 0.010) and 12.4 (<em>P</em> < 0.001), respectively. No significant difference was found between o1 and ChatGPT-4o. When stratified by subtopics, o1 consistently demonstrated superior correctness and readability. In completeness, ChatGPT-4o achieved the highest score of 12.4, followed by o1 (10.8), though the difference was not statistically significant. o1 showed notable limitations in completeness for ocular symptom queries, scoring 5.5 out of 15.</div></div><div><h3>Conclusions</h3><div>While o1 is marketed as offering improved reasoning capabilities, its performance in addressing eye care queries does not significantly differ from its predecessor, ChatGPT-4o. Nevertheless, it surpasses ChatGPT-4, particularly in correctness and readability.</div></div><div><h3>Financial Disclosure(s)</h3><div>Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.</div></div>","PeriodicalId":74363,"journal":{"name":"Ophthalmology science","volume":"5 4","pages":"Article 100745"},"PeriodicalIF":3.2000,"publicationDate":"2025-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Can OpenAI's New o1 Model Outperform Its Predecessors in Common Eye Care Queries?\",\"authors\":\"Krithi Pushpanathan MSc , Minjie Zou MMed , Sahana Srinivasan BEng , Wendy Meihua Wong MMed , Erlangga Ariadarma Mangunkusumo MD , George Naveen Thomas MMed , Yien Lai MMed , Chen-Hsin Sun MD , Janice Sing Harn Lam MMed , Marcus Chun Jin Tan MMed , Hazel Anne Hui'En Lin MMed , Weizhi Ma PhD , Victor Teck Chang Koh MMed , David Ziyou Chen MMed , Yih-Chung Tham PhD\",\"doi\":\"10.1016/j.xops.2025.100745\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Objective</h3><div>The newly launched OpenAI o1 is said to offer improved reasoning, potentially providing higher quality responses to eye care queries. However, its performance remains unassessed. We evaluated the performance of o1, ChatGPT-4o, and ChatGPT-4 in addressing ophthalmic-related queries, focusing on correctness, completeness, and readability.</div></div><div><h3>Design</h3><div>Cross-sectional study.</div></div><div><h3>Subjects</h3><div>Sixteen queries, previously identified as suboptimally responded to by ChatGPT-4 from prior studies, were used, covering 3 subtopics: myopia (6 questions), ocular symptoms (4 questions), and retinal conditions (6 questions).</div></div><div><h3>Methods</h3><div>For each subtopic, 3 attending-level ophthalmologists, masked to the model sources, evaluated the responses based on correctness, completeness, and readability (on a 5-point scale for each metric).</div></div><div><h3>Main Outcome Measures</h3><div>Mean summed scores of each model for correctness, completeness, and readability, rated on a 5-point scale (maximum score: 15).</div></div><div><h3>Results</h3><div>O1 scored highest in correctness (12.6) and readability (14.2), outperforming ChatGPT-4, which scored 10.3 (<em>P</em> = 0.010) and 12.4 (<em>P</em> < 0.001), respectively. No significant difference was found between o1 and ChatGPT-4o. When stratified by subtopics, o1 consistently demonstrated superior correctness and readability. In completeness, ChatGPT-4o achieved the highest score of 12.4, followed by o1 (10.8), though the difference was not statistically significant. o1 showed notable limitations in completeness for ocular symptom queries, scoring 5.5 out of 15.</div></div><div><h3>Conclusions</h3><div>While o1 is marketed as offering improved reasoning capabilities, its performance in addressing eye care queries does not significantly differ from its predecessor, ChatGPT-4o. Nevertheless, it surpasses ChatGPT-4, particularly in correctness and readability.</div></div><div><h3>Financial Disclosure(s)</h3><div>Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.</div></div>\",\"PeriodicalId\":74363,\"journal\":{\"name\":\"Ophthalmology science\",\"volume\":\"5 4\",\"pages\":\"Article 100745\"},\"PeriodicalIF\":3.2000,\"publicationDate\":\"2025-02-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Ophthalmology science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2666914525000430\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"OPHTHALMOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ophthalmology science","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666914525000430","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"OPHTHALMOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

新推出的OpenAI 01据说提供了改进的推理，可能为眼睛保健查询提供更高质量的回答。然而，它的表现仍未得到评估。我们评估了01、chatgpt - 40和ChatGPT-4在处理眼科相关查询方面的性能，重点是正确性、完整性和可读性。DesignCross-sectional研究。受试者使用了16个先前被ChatGPT-4从先前的研究中确定为次优应答的问题，涵盖3个子主题：近视（6个问题）、眼部症状（4个问题）和视网膜状况（6个问题）。方法对于每个小主题，3名主治级眼科医生对模型来源进行屏蔽，根据答案的正确性、完整性和可读性（每项指标5分制）对答案进行评估。主要结果测量方法每个模型在正确性、完整性和可读性方面的平均总得分，以5分制评分（最高得分：15分）。结果so1在正确性（12.6）和可读性（14.2）上得分最高，优于ChatGPT-4的10.3分（P = 0.010）和12.4分(P <；分别为0.001)。o1与chatgpt - 40之间无显著差异。当按子主题分层时，01始终表现出优越的正确性和可读性。完备性方面，chatgpt - 40得分最高，为12.4分，其次为01分（10.8分），但差异无统计学意义。O1在眼部症状查询的完整性方面有明显的局限性，15分中得5.5分。虽然o1被宣传为提供改进的推理能力，但它在解决眼科保健查询方面的表现与其前身chatgpt - 40没有显著差异。尽管如此，它还是超过了ChatGPT-4，特别是在正确性和可读性方面。财务披露专有或商业披露可在本文末尾的脚注和披露中找到。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Can OpenAI's New o1 Model Outperform Its Predecessors in Common Eye Care Queries?

Objective

The newly launched OpenAI o1 is said to offer improved reasoning, potentially providing higher quality responses to eye care queries. However, its performance remains unassessed. We evaluated the performance of o1, ChatGPT-4o, and ChatGPT-4 in addressing ophthalmic-related queries, focusing on correctness, completeness, and readability.

Design

Cross-sectional study.

Subjects

Sixteen queries, previously identified as suboptimally responded to by ChatGPT-4 from prior studies, were used, covering 3 subtopics: myopia (6 questions), ocular symptoms (4 questions), and retinal conditions (6 questions).

Methods

For each subtopic, 3 attending-level ophthalmologists, masked to the model sources, evaluated the responses based on correctness, completeness, and readability (on a 5-point scale for each metric).

Main Outcome Measures

Mean summed scores of each model for correctness, completeness, and readability, rated on a 5-point scale (maximum score: 15).

Results

O1 scored highest in correctness (12.6) and readability (14.2), outperforming ChatGPT-4, which scored 10.3 (P = 0.010) and 12.4 (P < 0.001), respectively. No significant difference was found between o1 and ChatGPT-4o. When stratified by subtopics, o1 consistently demonstrated superior correctness and readability. In completeness, ChatGPT-4o achieved the highest score of 12.4, followed by o1 (10.8), though the difference was not statistically significant. o1 showed notable limitations in completeness for ocular symptom queries, scoring 5.5 out of 15.

Conclusions

While o1 is marketed as offering improved reasoning capabilities, its performance in addressing eye care queries does not significantly differ from its predecessor, ChatGPT-4o. Nevertheless, it surpasses ChatGPT-4, particularly in correctness and readability.