Dhruva Gupta, Sarah L Wagner, Alexandra G Castillejos Ellenthal, Andrew W Gross, Edward S Lu, Enchi K Chang, Arya S Rao, Marc D Succi
{"title":"Comparing Performance of Large Language Model-Based Tools on Patient-Driven Glaucoma Inquiries.","authors":"Dhruva Gupta, Sarah L Wagner, Alexandra G Castillejos Ellenthal, Andrew W Gross, Edward S Lu, Enchi K Chang, Arya S Rao, Marc D Succi","doi":"10.1097/IJG.0000000000002627","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>Large language models (LLMs) can assist patients who seek medical knowledge online to guide their own glaucoma care. Understanding the differences in LLM performance on glaucoma-related questions can inform patients about the best resources to obtain relevant information.</p><p><strong>Methods: </strong>This cross-sectional study evaluated the accuracy, comprehensiveness, quality, and readability of LLM-generated responses to glaucoma inquiries. Seven questions posted by patients on the American Academy of Ophthalmology's Eye Care Forum were randomly selected and prompted into GPT-4o, GPT-4o Mini, Gemini Pro, and Gemini Flash in September 2024. Four physicians practicing ophthalmology assessed responses using a Likert scale based on accuracy, comprehensiveness, and quality. The Flesch-Kincaid Grade level measured readability while Bidirectional Encoder Representations from Transformers (BERT) Scores measured semantic similarity between LLM responses. Statistical analysis involved either the Kruskal-Wallis test with Dunn's post-hoc test or ANOVA analysis with Tukey's Honestly Significant Difference (HSD) test.</p><p><strong>Results: </strong>GPT-4o rated higher in accuracy (P=0.016), comprehensiveness (P=0.007), and quality (P=0.002) compared to Gemini Pro. GPT-4o Mini rated higher in comprehensiveness (P=0.011) and quality (P=0.007). Gemini Flash and Gemini Pro were similar across all criteria. There were no differences in readability, and LLMs mostly produced semantically similar responses.</p><p><strong>Conclusions: </strong>GPT models surpass Gemini Pro in addressing commonly asked questions about glaucoma, providing valuable insights into the application of LLMs for providing health information.</p>","PeriodicalId":15938,"journal":{"name":"Journal of Glaucoma","volume":" ","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2025-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Glaucoma","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1097/IJG.0000000000002627","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"OPHTHALMOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Purpose: Large language models (LLMs) can assist patients who seek medical knowledge online to guide their own glaucoma care. Understanding the differences in LLM performance on glaucoma-related questions can inform patients about the best resources to obtain relevant information.
Methods: This cross-sectional study evaluated the accuracy, comprehensiveness, quality, and readability of LLM-generated responses to glaucoma inquiries. Seven questions posted by patients on the American Academy of Ophthalmology's Eye Care Forum were randomly selected and prompted into GPT-4o, GPT-4o Mini, Gemini Pro, and Gemini Flash in September 2024. Four physicians practicing ophthalmology assessed responses using a Likert scale based on accuracy, comprehensiveness, and quality. The Flesch-Kincaid Grade level measured readability while Bidirectional Encoder Representations from Transformers (BERT) Scores measured semantic similarity between LLM responses. Statistical analysis involved either the Kruskal-Wallis test with Dunn's post-hoc test or ANOVA analysis with Tukey's Honestly Significant Difference (HSD) test.
Results: GPT-4o rated higher in accuracy (P=0.016), comprehensiveness (P=0.007), and quality (P=0.002) compared to Gemini Pro. GPT-4o Mini rated higher in comprehensiveness (P=0.011) and quality (P=0.007). Gemini Flash and Gemini Pro were similar across all criteria. There were no differences in readability, and LLMs mostly produced semantically similar responses.
Conclusions: GPT models surpass Gemini Pro in addressing commonly asked questions about glaucoma, providing valuable insights into the application of LLMs for providing health information.
期刊介绍:
The Journal of Glaucoma is a peer reviewed journal addressing the spectrum of issues affecting definition, diagnosis, and management of glaucoma and providing a forum for lively and stimulating discussion of clinical, scientific, and socioeconomic factors affecting care of glaucoma patients.