Comparing Performance of Large Language Model-Based Tools on Patient-Driven Glaucoma Inquiries.

IF 1.8 4区 医学 Q2 OPHTHALMOLOGY
Dhruva Gupta, Sarah L Wagner, Alexandra G Castillejos Ellenthal, Andrew W Gross, Edward S Lu, Enchi K Chang, Arya S Rao, Marc D Succi
{"title":"Comparing Performance of Large Language Model-Based Tools on Patient-Driven Glaucoma Inquiries.","authors":"Dhruva Gupta, Sarah L Wagner, Alexandra G Castillejos Ellenthal, Andrew W Gross, Edward S Lu, Enchi K Chang, Arya S Rao, Marc D Succi","doi":"10.1097/IJG.0000000000002627","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>Large language models (LLMs) can assist patients who seek medical knowledge online to guide their own glaucoma care. Understanding the differences in LLM performance on glaucoma-related questions can inform patients about the best resources to obtain relevant information.</p><p><strong>Methods: </strong>This cross-sectional study evaluated the accuracy, comprehensiveness, quality, and readability of LLM-generated responses to glaucoma inquiries. Seven questions posted by patients on the American Academy of Ophthalmology's Eye Care Forum were randomly selected and prompted into GPT-4o, GPT-4o Mini, Gemini Pro, and Gemini Flash in September 2024. Four physicians practicing ophthalmology assessed responses using a Likert scale based on accuracy, comprehensiveness, and quality. The Flesch-Kincaid Grade level measured readability while Bidirectional Encoder Representations from Transformers (BERT) Scores measured semantic similarity between LLM responses. Statistical analysis involved either the Kruskal-Wallis test with Dunn's post-hoc test or ANOVA analysis with Tukey's Honestly Significant Difference (HSD) test.</p><p><strong>Results: </strong>GPT-4o rated higher in accuracy (P=0.016), comprehensiveness (P=0.007), and quality (P=0.002) compared to Gemini Pro. GPT-4o Mini rated higher in comprehensiveness (P=0.011) and quality (P=0.007). Gemini Flash and Gemini Pro were similar across all criteria. There were no differences in readability, and LLMs mostly produced semantically similar responses.</p><p><strong>Conclusions: </strong>GPT models surpass Gemini Pro in addressing commonly asked questions about glaucoma, providing valuable insights into the application of LLMs for providing health information.</p>","PeriodicalId":15938,"journal":{"name":"Journal of Glaucoma","volume":" ","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2025-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Glaucoma","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1097/IJG.0000000000002627","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"OPHTHALMOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Purpose: Large language models (LLMs) can assist patients who seek medical knowledge online to guide their own glaucoma care. Understanding the differences in LLM performance on glaucoma-related questions can inform patients about the best resources to obtain relevant information.

Methods: This cross-sectional study evaluated the accuracy, comprehensiveness, quality, and readability of LLM-generated responses to glaucoma inquiries. Seven questions posted by patients on the American Academy of Ophthalmology's Eye Care Forum were randomly selected and prompted into GPT-4o, GPT-4o Mini, Gemini Pro, and Gemini Flash in September 2024. Four physicians practicing ophthalmology assessed responses using a Likert scale based on accuracy, comprehensiveness, and quality. The Flesch-Kincaid Grade level measured readability while Bidirectional Encoder Representations from Transformers (BERT) Scores measured semantic similarity between LLM responses. Statistical analysis involved either the Kruskal-Wallis test with Dunn's post-hoc test or ANOVA analysis with Tukey's Honestly Significant Difference (HSD) test.

Results: GPT-4o rated higher in accuracy (P=0.016), comprehensiveness (P=0.007), and quality (P=0.002) compared to Gemini Pro. GPT-4o Mini rated higher in comprehensiveness (P=0.011) and quality (P=0.007). Gemini Flash and Gemini Pro were similar across all criteria. There were no differences in readability, and LLMs mostly produced semantically similar responses.

Conclusions: GPT models surpass Gemini Pro in addressing commonly asked questions about glaucoma, providing valuable insights into the application of LLMs for providing health information.

比较基于大型语言模型的工具在患者驱动的青光眼查询中的性能。
目的:大语言模型(Large language models, LLMs)可以帮助在线寻求医学知识的患者指导自己的青光眼护理。了解LLM在青光眼相关问题上的表现差异,可以告知患者获取相关信息的最佳资源。方法:本横断面研究评估了llm生成的青光眼查询应答的准确性、全面性、质量和可读性。2024年9月,患者在美国眼科学会的眼保健论坛上发布的7个问题被随机抽取,并被提示为gpt - 40、gpt - 40 Mini、Gemini Pro和Gemini Flash。四名眼科医生使用基于准确性、全面性和质量的李克特量表评估反应。Flesch-Kincaid等级水平测量可读性,而双向编码器表示从变形金刚(BERT)得分测量语义相似度的LLM响应。统计分析包括Kruskal-Wallis检验与Dunn事后检验或ANOVA分析与Tukey的诚实显著差异(HSD)检验。结果:与Gemini Pro相比,gpt - 40在准确性(P=0.016)、全面性(P=0.007)和质量(P=0.002)方面得分更高。gpt - 40 Mini在综合性(P=0.011)和质量(P=0.007)方面得分更高。Gemini Flash和Gemini Pro在所有标准上都是相似的。在可读性上没有差异,法学硕士大多产生语义上相似的反应。结论:GPT模型在解决青光眼常见问题方面优于Gemini Pro,为llm在提供健康信息方面的应用提供了有价值的见解。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Journal of Glaucoma
Journal of Glaucoma 医学-眼科学
CiteScore
4.20
自引率
10.00%
发文量
330
审稿时长
4-8 weeks
期刊介绍: The Journal of Glaucoma is a peer reviewed journal addressing the spectrum of issues affecting definition, diagnosis, and management of glaucoma and providing a forum for lively and stimulating discussion of clinical, scientific, and socioeconomic factors affecting care of glaucoma patients.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信