From accuracy to comprehensibility: Evaluating large language models for myopia patient queries

IF 3.7 3区 医学 Q1 HEALTH POLICY & SERVICES
Ezgi Karataş , Ceren Durmaz Engin
{"title":"From accuracy to comprehensibility: Evaluating large language models for myopia patient queries","authors":"Ezgi Karataş ,&nbsp;Ceren Durmaz Engin","doi":"10.1016/j.hlpt.2025.101073","DOIUrl":null,"url":null,"abstract":"<div><h3>Objectives</h3><div>This study evaluated the accuracy and comprehensibility of responses from three large language models (LLMs)—ChatGPT-4, Gemini, and Copilot—when addressing patient queries about myopia. Accurate, understandable information is crucial for effective patient education and management of this common refractive error.</div></div><div><h3>Methods</h3><div>Sixty questions across six categories (definition, etiology, symptoms and diagnosis, myopia control, correction, and new treatments) were presented to ChatGPT-4, Gemini, and Copilot. Responses were assessed for accuracy by two experienced ophthalmologists using a 3-point Likert scale. Quality and reliability were evaluated using the DISCERN and EQIP scales, while readability was measured with the Flesch Reading Ease Score, Flesch-Kincaid Grade Level, and Coleman-Liau Index. Statistical analyses were conducted using SPSS version 25.</div></div><div><h3>Results</h3><div>ChatGPT-4 provided the most accurate responses in the defsinition, symptoms, and diagnosis categories, with a 75 % overall success rate. Copilot had a similar success rate of 73.3 % but the highest inaccuracy rate (6.7 %). Gemini had a 71.7 % success rate. Copilot scored highest in reliability (DISCERN 76) and readability (Flesch Reading Ease 46.74), followed by ChatGPT-4 and Gemini. No significant differences in accuracy were found among the LLMs across categories.</div></div><div><h3>Conclusions</h3><div>All three LLMs performed well in providing myopia-related information. Copilot excelled in readability and reliability despite a higher inaccuracy rate. ChatGPT-4 and Copilot outperformed Gemini, likely due to their advanced architectures and training methodologies. These findings highlight the potential of LLMs in patient education and the need for ongoing improvements to ensure accurate, comprehensible AI-generated health information.</div></div>","PeriodicalId":48672,"journal":{"name":"Health Policy and Technology","volume":"14 6","pages":"Article 101073"},"PeriodicalIF":3.7000,"publicationDate":"2025-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Health Policy and Technology","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2211883725001017","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH POLICY & SERVICES","Score":null,"Total":0}
引用次数: 0

Abstract

Objectives

This study evaluated the accuracy and comprehensibility of responses from three large language models (LLMs)—ChatGPT-4, Gemini, and Copilot—when addressing patient queries about myopia. Accurate, understandable information is crucial for effective patient education and management of this common refractive error.

Methods

Sixty questions across six categories (definition, etiology, symptoms and diagnosis, myopia control, correction, and new treatments) were presented to ChatGPT-4, Gemini, and Copilot. Responses were assessed for accuracy by two experienced ophthalmologists using a 3-point Likert scale. Quality and reliability were evaluated using the DISCERN and EQIP scales, while readability was measured with the Flesch Reading Ease Score, Flesch-Kincaid Grade Level, and Coleman-Liau Index. Statistical analyses were conducted using SPSS version 25.

Results

ChatGPT-4 provided the most accurate responses in the defsinition, symptoms, and diagnosis categories, with a 75 % overall success rate. Copilot had a similar success rate of 73.3 % but the highest inaccuracy rate (6.7 %). Gemini had a 71.7 % success rate. Copilot scored highest in reliability (DISCERN 76) and readability (Flesch Reading Ease 46.74), followed by ChatGPT-4 and Gemini. No significant differences in accuracy were found among the LLMs across categories.

Conclusions

All three LLMs performed well in providing myopia-related information. Copilot excelled in readability and reliability despite a higher inaccuracy rate. ChatGPT-4 and Copilot outperformed Gemini, likely due to their advanced architectures and training methodologies. These findings highlight the potential of LLMs in patient education and the need for ongoing improvements to ensure accurate, comprehensible AI-generated health information.
从准确性到可理解性:评估近视患者查询的大型语言模型
目的:本研究评估了chatgpt -4、Gemini和copilot这三种大型语言模型(LLMs)在回答患者关于近视的询问时的回答的准确性和可理解性。准确,可理解的信息是有效的患者教育和管理这种常见的屈光不正至关重要。方法向ChatGPT-4、Gemini和Copilot提出6个类别(定义、病因、症状和诊断、近视控制、矫正和新治疗)的60个问题。由两名经验丰富的眼科医生使用李克特3分量表评估回答的准确性。使用DISCERN和EQIP量表评估质量和可靠性,使用Flesch Reading Ease Score、Flesch- kincaid Grade Level和Coleman-Liau Index测量可读性。采用SPSS 25进行统计分析。结果schatgpt -4在定义、症状和诊断分类上提供了最准确的反应,总成功率为75%。副驾驶有相似的73.3%的成功率,但最高的不准确率(6.7%)。双子座的成功率为71.7%。Copilot在可靠性(DISCERN 76分)和可读性(Flesch Reading Ease 46.74分)方面得分最高,其次是ChatGPT-4和Gemini。在不同类别的法学硕士中,准确率没有显着差异。结论3种llm均能较好地提供近视相关信息。副驾驶在可读性和可靠性方面表现出色,尽管误差率较高。ChatGPT-4和Copilot的表现优于Gemini,可能是由于它们先进的架构和训练方法。这些发现强调了法学硕士在患者教育方面的潜力,以及不断改进以确保人工智能生成的准确、可理解的健康信息的必要性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Health Policy and Technology
Health Policy and Technology Medicine-Health Policy
CiteScore
9.20
自引率
3.30%
发文量
78
审稿时长
88 days
期刊介绍: Health Policy and Technology (HPT), is the official journal of the Fellowship of Postgraduate Medicine (FPM), a cross-disciplinary journal, which focuses on past, present and future health policy and the role of technology in clinical and non-clinical national and international health environments. HPT provides a further excellent way for the FPM to continue to make important national and international contributions to development of policy and practice within medicine and related disciplines. The aim of HPT is to publish relevant, timely and accessible articles and commentaries to support policy-makers, health professionals, health technology providers, patient groups and academia interested in health policy and technology. Topics covered by HPT will include: - Health technology, including drug discovery, diagnostics, medicines, devices, therapeutic delivery and eHealth systems - Cross-national comparisons on health policy using evidence-based approaches - National studies on health policy to determine the outcomes of technology-driven initiatives - Cross-border eHealth including health tourism - The digital divide in mobility, access and affordability of healthcare - Health technology assessment (HTA) methods and tools for evaluating the effectiveness of clinical and non-clinical health technologies - Health and eHealth indicators and benchmarks (measure/metrics) for understanding the adoption and diffusion of health technologies - Health and eHealth models and frameworks to support policy-makers and other stakeholders in decision-making - Stakeholder engagement with health technologies (clinical and patient/citizen buy-in) - Regulation and health economics
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信