Comparing Performance of Large Language Model-Based Tools on Patient-Driven Glaucoma Inquiries.

IF 1.8 4区医学 Q2 OPHTHALMOLOGY

Journal of Glaucoma Pub Date : 2025-09-10 DOI:10.1097/IJG.0000000000002627

Dhruva Gupta, Sarah L Wagner, Alexandra G Castillejos Ellenthal, Andrew W Gross, Edward S Lu, Enchi K Chang, Arya S Rao, Marc D Succi

{"title":"Comparing Performance of Large Language Model-Based Tools on Patient-Driven Glaucoma Inquiries.","authors":"Dhruva Gupta, Sarah L Wagner, Alexandra G Castillejos Ellenthal, Andrew W Gross, Edward S Lu, Enchi K Chang, Arya S Rao, Marc D Succi","doi":"10.1097/IJG.0000000000002627","DOIUrl":null,"url":null,"abstract":"Purpose: Large language models (LLMs) can assist patients who seek medical knowledge online to guide their own glaucoma care. Understanding the differences in LLM performance on glaucoma-related questions can inform patients about the best resources to obtain relevant information.Methods: This cross-sectional study evaluated the accuracy, comprehensiveness, quality, and readability of LLM-generated responses to glaucoma inquiries. Seven questions posted by patients on the American Academy of Ophthalmology's Eye Care Forum were randomly selected and prompted into GPT-4o, GPT-4o Mini, Gemini Pro, and Gemini Flash in September 2024. Four physicians practicing ophthalmology assessed responses using a Likert scale based on accuracy, comprehensiveness, and quality. The Flesch-Kincaid Grade level measured readability while Bidirectional Encoder Representations from Transformers (BERT) Scores measured semantic similarity between LLM responses. Statistical analysis involved either the Kruskal-Wallis test with Dunn's post-hoc test or ANOVA analysis with Tukey's Honestly Significant Difference (HSD) test.Results: GPT-4o rated higher in accuracy (P=0.016), comprehensiveness (P=0.007), and quality (P=0.002) compared to Gemini Pro. GPT-4o Mini rated higher in comprehensiveness (P=0.011) and quality (P=0.007). Gemini Flash and Gemini Pro were similar across all criteria. There were no differences in readability, and LLMs mostly produced semantically similar responses.Conclusions: GPT models surpass Gemini Pro in addressing commonly asked questions about glaucoma, providing valuable insights into the application of LLMs for providing health information.","PeriodicalId":15938,"journal":{"name":"Journal of Glaucoma","volume":" ","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2025-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Glaucoma","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1097/IJG.0000000000002627","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"OPHTHALMOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose: Large language models (LLMs) can assist patients who seek medical knowledge online to guide their own glaucoma care. Understanding the differences in LLM performance on glaucoma-related questions can inform patients about the best resources to obtain relevant information.

Methods: This cross-sectional study evaluated the accuracy, comprehensiveness, quality, and readability of LLM-generated responses to glaucoma inquiries. Seven questions posted by patients on the American Academy of Ophthalmology's Eye Care Forum were randomly selected and prompted into GPT-4o, GPT-4o Mini, Gemini Pro, and Gemini Flash in September 2024. Four physicians practicing ophthalmology assessed responses using a Likert scale based on accuracy, comprehensiveness, and quality. The Flesch-Kincaid Grade level measured readability while Bidirectional Encoder Representations from Transformers (BERT) Scores measured semantic similarity between LLM responses. Statistical analysis involved either the Kruskal-Wallis test with Dunn's post-hoc test or ANOVA analysis with Tukey's Honestly Significant Difference (HSD) test.

Results: GPT-4o rated higher in accuracy (P=0.016), comprehensiveness (P=0.007), and quality (P=0.002) compared to Gemini Pro. GPT-4o Mini rated higher in comprehensiveness (P=0.011) and quality (P=0.007). Gemini Flash and Gemini Pro were similar across all criteria. There were no differences in readability, and LLMs mostly produced semantically similar responses.

Conclusions: GPT models surpass Gemini Pro in addressing commonly asked questions about glaucoma, providing valuable insights into the application of LLMs for providing health information.

查看原文本刊更多论文

比较基于大型语言模型的工具在患者驱动的青光眼查询中的性能。

目的：大语言模型（Large language models, LLMs）可以帮助在线寻求医学知识的患者指导自己的青光眼护理。了解LLM在青光眼相关问题上的表现差异，可以告知患者获取相关信息的最佳资源。方法：本横断面研究评估了llm生成的青光眼查询应答的准确性、全面性、质量和可读性。2024年9月，患者在美国眼科学会的眼保健论坛上发布的7个问题被随机抽取，并被提示为gpt - 40、gpt - 40 Mini、Gemini Pro和Gemini Flash。四名眼科医生使用基于准确性、全面性和质量的李克特量表评估反应。Flesch-Kincaid等级水平测量可读性，而双向编码器表示从变形金刚（BERT）得分测量语义相似度的LLM响应。统计分析包括Kruskal-Wallis检验与Dunn事后检验或ANOVA分析与Tukey的诚实显著差异（HSD）检验。结果：与Gemini Pro相比，gpt - 40在准确性（P=0.016）、全面性（P=0.007）和质量（P=0.002）方面得分更高。gpt - 40 Mini在综合性（P=0.011）和质量（P=0.007）方面得分更高。Gemini Flash和Gemini Pro在所有标准上都是相似的。在可读性上没有差异，法学硕士大多产生语义上相似的反应。结论：GPT模型在解决青光眼常见问题方面优于Gemini Pro，为llm在提供健康信息方面的应用提供了有价值的见解。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Glaucoma 医学-眼科学

CiteScore

4.20

自引率

10.00%

发文量

330

审稿时长

4-8 weeks

期刊介绍： The Journal of Glaucoma is a peer reviewed journal addressing the spectrum of issues affecting definition, diagnosis, and management of glaucoma and providing a forum for lively and stimulating discussion of clinical, scientific, and socioeconomic factors affecting care of glaucoma patients.