Evaluating the reliability of the responses of large language models to keratoconus-related questions.

IF 1.5 4区医学 Q3 OPHTHALMOLOGY

Clinical and Experimental Optometry Pub Date : 2025-09-01 Epub Date: 2024-10-24 DOI:10.1080/08164622.2024.2419524

Mustafa Kayabaşı, Seher Köksaldı, Ceren Durmaz Engin

{"title":"Evaluating the reliability of the responses of large language models to keratoconus-related questions.","authors":"Mustafa Kayabaşı, Seher Köksaldı, Ceren Durmaz Engin","doi":"10.1080/08164622.2024.2419524","DOIUrl":null,"url":null,"abstract":"Clinical relevance: Artificial intelligence has undergone a rapid evolution and large language models (LLMs) have become promising tools for healthcare, with the ability of providing human-like responses to questions. The capabilities of these tools in addressing questions related to keratoconus (KCN) have not been previously explored.Background: In this study, the responses were evaluated from three LLMs - ChatGPT-4, Copilot, and Gemini - to common patient questions regarding KCN.Methods: Fifty real-life patient inquiries regarding general information, aetiology, symptoms and diagnosis, progression, and treatment of KCN were presented to the LLMs. Evaluations of the answers were conducted by three ophthalmologists with a 5-point Likert scale ranging from 'strongly disagreed' to 'strongly agreed'. The reliability of the responses provided by LLMs was evaluated using the DISCERN and the Ensuring Quality Information for Patients (EQIP) scales. Readability metrics (Flesch Reading Ease Score, Flesch-Kincaid Grade Level, and Coleman-Liau Index) were calculated to evaluate the complexity of responses.Results: ChatGPT-4 consistently scored 3 points or higher for all (100%) its responses, while Copilot had five (10%) and Gemini had two (4%) responses scoring 2 points or below. ChatGPT-4 achieved a 'strongly agree' rate of 74% across all questions, markedly superior to Copilot at 34% and Gemini at 42% (p < 0.001); and recorded the highest 'strongly agree' rates in general information and symptoms & diagnosis categories (90% for both). The median Likert scores differed among LLMs (p < 0.001), with ChatGPT-4 scoring highest and Copilot scoring lowest. Although ChatGPT-4 exhibited more reliability based on the DISCERN scale, it was characterised by lower readability and higher complexity. While all LLMs provided responses categorised as 'extremely difficult to read', the responses provided by Copilot showed higher readability.Conclusions: Despite the responses provided by ChatGPT-4 exhibiting lower readability and greater complexity, it emerged as the most proficient in answering KCN-related questions.","PeriodicalId":10214,"journal":{"name":"Clinical and Experimental Optometry","volume":" ","pages":"784-791"},"PeriodicalIF":1.5000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Clinical and Experimental Optometry","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1080/08164622.2024.2419524","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/10/24 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"OPHTHALMOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Clinical relevance: Artificial intelligence has undergone a rapid evolution and large language models (LLMs) have become promising tools for healthcare, with the ability of providing human-like responses to questions. The capabilities of these tools in addressing questions related to keratoconus (KCN) have not been previously explored.

Background: In this study, the responses were evaluated from three LLMs - ChatGPT-4, Copilot, and Gemini - to common patient questions regarding KCN.

Methods: Fifty real-life patient inquiries regarding general information, aetiology, symptoms and diagnosis, progression, and treatment of KCN were presented to the LLMs. Evaluations of the answers were conducted by three ophthalmologists with a 5-point Likert scale ranging from 'strongly disagreed' to 'strongly agreed'. The reliability of the responses provided by LLMs was evaluated using the DISCERN and the Ensuring Quality Information for Patients (EQIP) scales. Readability metrics (Flesch Reading Ease Score, Flesch-Kincaid Grade Level, and Coleman-Liau Index) were calculated to evaluate the complexity of responses.

Results: ChatGPT-4 consistently scored 3 points or higher for all (100%) its responses, while Copilot had five (10%) and Gemini had two (4%) responses scoring 2 points or below. ChatGPT-4 achieved a 'strongly agree' rate of 74% across all questions, markedly superior to Copilot at 34% and Gemini at 42% (p < 0.001); and recorded the highest 'strongly agree' rates in general information and symptoms & diagnosis categories (90% for both). The median Likert scores differed among LLMs (p < 0.001), with ChatGPT-4 scoring highest and Copilot scoring lowest. Although ChatGPT-4 exhibited more reliability based on the DISCERN scale, it was characterised by lower readability and higher complexity. While all LLMs provided responses categorised as 'extremely difficult to read', the responses provided by Copilot showed higher readability.

Conclusions: Despite the responses provided by ChatGPT-4 exhibiting lower readability and greater complexity, it emerged as the most proficient in answering KCN-related questions.

查看原文本刊更多论文

评估大型语言模型对角膜病相关问题回答的可靠性。

临床相关性：人工智能的发展日新月异，大型语言模型（LLMs）已成为医疗保健领域前景广阔的工具，能够对问题做出类似人类的回答。这些工具在解决与角膜病（KCN）相关的问题方面的能力尚未得到探讨：本研究评估了 ChatGPT-4、Copilot 和 Gemini 这三种 LLM 对患者提出的有关 KCN 的常见问题的回复：方法：向 LLMs 提供了 50 个真实的患者咨询，内容涉及 KCN 的一般信息、病因、症状和诊断、进展和治疗。由三位眼科专家对回答进行评估，采用李克特五点量表，从 "非常不同意 "到 "非常同意 "不等。使用 DISCERN 和 "确保患者信息质量"（EQIP）量表评估了眼科医生回答问题的可靠性。还计算了可读性指标（Flesch 阅读容易程度得分、Flesch-Kincaid 分级和 Coleman-Liau 指数），以评估回答的复杂程度：结果：ChatGPT-4 的所有回答（100%）均获得 3 分或以上，而 Copilot 有 5 个回答（10%）和 Gemini 有 2 个回答（4%）获得 2 分或以下。在所有问题中，ChatGPT-4 的 "非常同意 "率达到 74%，明显高于 Copilot 的 34% 和 Gemini 的 42%（p p 结论）：尽管 ChatGPT-4 提供的回答可读性较低，复杂性较高，但它在回答 KCN 相关问题时却表现得最为熟练。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Clinical and Experimental Optometry 医学-眼科学

CiteScore

4.10

自引率

5.30%

发文量

132

审稿时长

6-12 weeks

期刊介绍： Clinical and Experimental Optometry is a peer reviewed journal listed by ISI and abstracted by PubMed, Web of Science, Scopus, Science Citation Index and Current Contents. It publishes original research papers and reviews in clinical optometry and vision science. Debate and discussion of controversial scientific and clinical issues is encouraged and letters to the Editor and short communications expressing points of view on matters within the Journal''s areas of interest are welcome. The Journal is published six times annually.