Comparative analysis of large language models in providing patient information about keratoconus and contact lenses.

IF 1.4 4区 医学 Q3 OPHTHALMOLOGY
Yavuz Kemal Aribas, Atike Burcin Tefon Aribas
{"title":"Comparative analysis of large language models in providing patient information about keratoconus and contact lenses.","authors":"Yavuz Kemal Aribas, Atike Burcin Tefon Aribas","doi":"10.1007/s10792-025-03711-2","DOIUrl":null,"url":null,"abstract":"<p><strong>Objective: </strong>To evaluate the accuracy, completeness, informational quality, and readability of responses generated by large language models (LLMs)-ChatGPT (OpenAI, USA), Gemini (Google, USA), and Copilot (Microsoft, USA)-to patient questions concerning keratoconus and contact lens use.</p><p><strong>Methods: </strong>In this cross-sectional study, 32 questions across eight domains were posed to the free versions of each model. Two independent ophthalmologists rated accuracy (6-point Likert scale) and completeness (3-point Likert scale). Information quality was assessed using the DISCERN instrument, and readability was evaluated with the Flesch Reading Ease Score (FRES) and Flesch-Kincaid Grade Level (FKGL). Inter-rater agreement was measured with Cohen's Kappa.</p><p><strong>Results: </strong>Inter-rater reliability showed at least fair agreement for all LLMs. (min κ = 0.365) ChatGPT achieved significantly higher accuracy than Gemini (p < 0.001) and Copilot (p = 0.010), and higher completeness than Gemini (p = 0.001) but was similar to Copilot (p = 0.101). DISCERN scores were highest for ChatGPT (64), followed by Copilot (61) and Gemini (55). All models produced difficult-to-read content (FRES: Gemini 49.7, Copilot 45.4, ChatGPT 40.7), with FKGL values at late high school level.</p><p><strong>Conclusion: </strong>All evaluated large language models were capable of providing generally accurate and thorough information regarding keratoconus and contact lens use. Nevertheless, limitations in readability across models highlight the importance of clinician oversight to ensure that patient education remains clear, accessible, and appropriately tailored to individual needs.</p>","PeriodicalId":14473,"journal":{"name":"International Ophthalmology","volume":"45 1","pages":"340"},"PeriodicalIF":1.4000,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Ophthalmology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s10792-025-03711-2","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"OPHTHALMOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Objective: To evaluate the accuracy, completeness, informational quality, and readability of responses generated by large language models (LLMs)-ChatGPT (OpenAI, USA), Gemini (Google, USA), and Copilot (Microsoft, USA)-to patient questions concerning keratoconus and contact lens use.

Methods: In this cross-sectional study, 32 questions across eight domains were posed to the free versions of each model. Two independent ophthalmologists rated accuracy (6-point Likert scale) and completeness (3-point Likert scale). Information quality was assessed using the DISCERN instrument, and readability was evaluated with the Flesch Reading Ease Score (FRES) and Flesch-Kincaid Grade Level (FKGL). Inter-rater agreement was measured with Cohen's Kappa.

Results: Inter-rater reliability showed at least fair agreement for all LLMs. (min κ = 0.365) ChatGPT achieved significantly higher accuracy than Gemini (p < 0.001) and Copilot (p = 0.010), and higher completeness than Gemini (p = 0.001) but was similar to Copilot (p = 0.101). DISCERN scores were highest for ChatGPT (64), followed by Copilot (61) and Gemini (55). All models produced difficult-to-read content (FRES: Gemini 49.7, Copilot 45.4, ChatGPT 40.7), with FKGL values at late high school level.

Conclusion: All evaluated large language models were capable of providing generally accurate and thorough information regarding keratoconus and contact lens use. Nevertheless, limitations in readability across models highlight the importance of clinician oversight to ensure that patient education remains clear, accessible, and appropriately tailored to individual needs.

大语言模型在圆锥角膜和隐形眼镜患者信息提供中的比较分析。
目的:评价大型语言模型(llm)——chatgpt (OpenAI,美国)、Gemini(谷歌,美国)和Copilot (Microsoft,美国)对患者有关圆锥角膜和隐形眼镜使用问题的回答的准确性、完整性、信息质量和可读性。方法:在这个横断面研究中,对每个模型的免费版本提出了8个领域的32个问题。两名独立的眼科医生评估了准确性(6分李克特量表)和完整性(3分李克特量表)。使用DISCERN工具评估信息质量,使用Flesch Reading Ease Score (FRES)和Flesch- kincaid Grade Level (FKGL)评估可读性。用科恩的Kappa量表来衡量评分者之间的一致性。结果:评估者间信度显示所有llm的一致性至少是公平的。(min κ = 0.365) ChatGPT的准确率明显高于Gemini (p)。结论:所有评估的大型语言模型都能够提供关于圆锥角膜和隐形眼镜使用的总体准确和全面的信息。然而,模型可读性的局限性突出了临床医生监督的重要性,以确保患者教育保持清晰,可访问,并适当地根据个人需求进行调整。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
3.20
自引率
0.00%
发文量
451
期刊介绍: International Ophthalmology provides the clinician with articles on all the relevant subspecialties of ophthalmology, with a broad international scope. The emphasis is on presentation of the latest clinical research in the field. In addition, the journal includes regular sections devoted to new developments in technologies, products, and techniques.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信