Comparative analysis of large language models in providing patient information about keratoconus and contact lenses.

IF 1.4 4区医学 Q3 OPHTHALMOLOGY

International Ophthalmology Pub Date : 2025-08-18 DOI:10.1007/s10792-025-03711-2

Yavuz Kemal Aribas, Atike Burcin Tefon Aribas

{"title":"Comparative analysis of large language models in providing patient information about keratoconus and contact lenses.","authors":"Yavuz Kemal Aribas, Atike Burcin Tefon Aribas","doi":"10.1007/s10792-025-03711-2","DOIUrl":null,"url":null,"abstract":"Objective: To evaluate the accuracy, completeness, informational quality, and readability of responses generated by large language models (LLMs)-ChatGPT (OpenAI, USA), Gemini (Google, USA), and Copilot (Microsoft, USA)-to patient questions concerning keratoconus and contact lens use.Methods: In this cross-sectional study, 32 questions across eight domains were posed to the free versions of each model. Two independent ophthalmologists rated accuracy (6-point Likert scale) and completeness (3-point Likert scale). Information quality was assessed using the DISCERN instrument, and readability was evaluated with the Flesch Reading Ease Score (FRES) and Flesch-Kincaid Grade Level (FKGL). Inter-rater agreement was measured with Cohen's Kappa.Results: Inter-rater reliability showed at least fair agreement for all LLMs. (min κ = 0.365) ChatGPT achieved significantly higher accuracy than Gemini (p < 0.001) and Copilot (p = 0.010), and higher completeness than Gemini (p = 0.001) but was similar to Copilot (p = 0.101). DISCERN scores were highest for ChatGPT (64), followed by Copilot (61) and Gemini (55). All models produced difficult-to-read content (FRES: Gemini 49.7, Copilot 45.4, ChatGPT 40.7), with FKGL values at late high school level.Conclusion: All evaluated large language models were capable of providing generally accurate and thorough information regarding keratoconus and contact lens use. Nevertheless, limitations in readability across models highlight the importance of clinician oversight to ensure that patient education remains clear, accessible, and appropriately tailored to individual needs.","PeriodicalId":14473,"journal":{"name":"International Ophthalmology","volume":"45 1","pages":"340"},"PeriodicalIF":1.4000,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Ophthalmology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s10792-025-03711-2","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"OPHTHALMOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Objective: To evaluate the accuracy, completeness, informational quality, and readability of responses generated by large language models (LLMs)-ChatGPT (OpenAI, USA), Gemini (Google, USA), and Copilot (Microsoft, USA)-to patient questions concerning keratoconus and contact lens use.

Methods: In this cross-sectional study, 32 questions across eight domains were posed to the free versions of each model. Two independent ophthalmologists rated accuracy (6-point Likert scale) and completeness (3-point Likert scale). Information quality was assessed using the DISCERN instrument, and readability was evaluated with the Flesch Reading Ease Score (FRES) and Flesch-Kincaid Grade Level (FKGL). Inter-rater agreement was measured with Cohen's Kappa.

Results: Inter-rater reliability showed at least fair agreement for all LLMs. (min κ = 0.365) ChatGPT achieved significantly higher accuracy than Gemini (p < 0.001) and Copilot (p = 0.010), and higher completeness than Gemini (p = 0.001) but was similar to Copilot (p = 0.101). DISCERN scores were highest for ChatGPT (64), followed by Copilot (61) and Gemini (55). All models produced difficult-to-read content (FRES: Gemini 49.7, Copilot 45.4, ChatGPT 40.7), with FKGL values at late high school level.

Conclusion: All evaluated large language models were capable of providing generally accurate and thorough information regarding keratoconus and contact lens use. Nevertheless, limitations in readability across models highlight the importance of clinician oversight to ensure that patient education remains clear, accessible, and appropriately tailored to individual needs.

查看原文本刊更多论文

大语言模型在圆锥角膜和隐形眼镜患者信息提供中的比较分析。

目的：评价大型语言模型（llm）——chatgpt （OpenAI，美国）、Gemini（谷歌，美国）和Copilot （Microsoft，美国）对患者有关圆锥角膜和隐形眼镜使用问题的回答的准确性、完整性、信息质量和可读性。方法：在这个横断面研究中，对每个模型的免费版本提出了8个领域的32个问题。两名独立的眼科医生评估了准确性（6分李克特量表）和完整性（3分李克特量表）。使用DISCERN工具评估信息质量，使用Flesch Reading Ease Score （FRES）和Flesch- kincaid Grade Level （FKGL）评估可读性。用科恩的Kappa量表来衡量评分者之间的一致性。结果：评估者间信度显示所有llm的一致性至少是公平的。（min κ = 0.365） ChatGPT的准确率明显高于Gemini (p)。结论：所有评估的大型语言模型都能够提供关于圆锥角膜和隐形眼镜使用的总体准确和全面的信息。然而，模型可读性的局限性突出了临床医生监督的重要性，以确保患者教育保持清晰，可访问，并适当地根据个人需求进行调整。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Ophthalmology OPHTHALMOLOGY-

CiteScore

3.20

自引率

0.00%

发文量

451

期刊介绍： International Ophthalmology provides the clinician with articles on all the relevant subspecialties of ophthalmology, with a broad international scope. The emphasis is on presentation of the latest clinical research in the field. In addition, the journal includes regular sections devoted to new developments in technologies, products, and techniques.