{"title":"From accuracy to comprehensibility: Evaluating large language models for myopia patient queries","authors":"Ezgi Karataş , Ceren Durmaz Engin","doi":"10.1016/j.hlpt.2025.101073","DOIUrl":null,"url":null,"abstract":"<div><h3>Objectives</h3><div>This study evaluated the accuracy and comprehensibility of responses from three large language models (LLMs)—ChatGPT-4, Gemini, and Copilot—when addressing patient queries about myopia. Accurate, understandable information is crucial for effective patient education and management of this common refractive error.</div></div><div><h3>Methods</h3><div>Sixty questions across six categories (definition, etiology, symptoms and diagnosis, myopia control, correction, and new treatments) were presented to ChatGPT-4, Gemini, and Copilot. Responses were assessed for accuracy by two experienced ophthalmologists using a 3-point Likert scale. Quality and reliability were evaluated using the DISCERN and EQIP scales, while readability was measured with the Flesch Reading Ease Score, Flesch-Kincaid Grade Level, and Coleman-Liau Index. Statistical analyses were conducted using SPSS version 25.</div></div><div><h3>Results</h3><div>ChatGPT-4 provided the most accurate responses in the defsinition, symptoms, and diagnosis categories, with a 75 % overall success rate. Copilot had a similar success rate of 73.3 % but the highest inaccuracy rate (6.7 %). Gemini had a 71.7 % success rate. Copilot scored highest in reliability (DISCERN 76) and readability (Flesch Reading Ease 46.74), followed by ChatGPT-4 and Gemini. No significant differences in accuracy were found among the LLMs across categories.</div></div><div><h3>Conclusions</h3><div>All three LLMs performed well in providing myopia-related information. Copilot excelled in readability and reliability despite a higher inaccuracy rate. ChatGPT-4 and Copilot outperformed Gemini, likely due to their advanced architectures and training methodologies. These findings highlight the potential of LLMs in patient education and the need for ongoing improvements to ensure accurate, comprehensible AI-generated health information.</div></div>","PeriodicalId":48672,"journal":{"name":"Health Policy and Technology","volume":"14 6","pages":"Article 101073"},"PeriodicalIF":3.7000,"publicationDate":"2025-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Health Policy and Technology","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2211883725001017","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH POLICY & SERVICES","Score":null,"Total":0}
引用次数: 0
Abstract
Objectives
This study evaluated the accuracy and comprehensibility of responses from three large language models (LLMs)—ChatGPT-4, Gemini, and Copilot—when addressing patient queries about myopia. Accurate, understandable information is crucial for effective patient education and management of this common refractive error.
Methods
Sixty questions across six categories (definition, etiology, symptoms and diagnosis, myopia control, correction, and new treatments) were presented to ChatGPT-4, Gemini, and Copilot. Responses were assessed for accuracy by two experienced ophthalmologists using a 3-point Likert scale. Quality and reliability were evaluated using the DISCERN and EQIP scales, while readability was measured with the Flesch Reading Ease Score, Flesch-Kincaid Grade Level, and Coleman-Liau Index. Statistical analyses were conducted using SPSS version 25.
Results
ChatGPT-4 provided the most accurate responses in the defsinition, symptoms, and diagnosis categories, with a 75 % overall success rate. Copilot had a similar success rate of 73.3 % but the highest inaccuracy rate (6.7 %). Gemini had a 71.7 % success rate. Copilot scored highest in reliability (DISCERN 76) and readability (Flesch Reading Ease 46.74), followed by ChatGPT-4 and Gemini. No significant differences in accuracy were found among the LLMs across categories.
Conclusions
All three LLMs performed well in providing myopia-related information. Copilot excelled in readability and reliability despite a higher inaccuracy rate. ChatGPT-4 and Copilot outperformed Gemini, likely due to their advanced architectures and training methodologies. These findings highlight the potential of LLMs in patient education and the need for ongoing improvements to ensure accurate, comprehensible AI-generated health information.
期刊介绍:
Health Policy and Technology (HPT), is the official journal of the Fellowship of Postgraduate Medicine (FPM), a cross-disciplinary journal, which focuses on past, present and future health policy and the role of technology in clinical and non-clinical national and international health environments.
HPT provides a further excellent way for the FPM to continue to make important national and international contributions to development of policy and practice within medicine and related disciplines. The aim of HPT is to publish relevant, timely and accessible articles and commentaries to support policy-makers, health professionals, health technology providers, patient groups and academia interested in health policy and technology.
Topics covered by HPT will include:
- Health technology, including drug discovery, diagnostics, medicines, devices, therapeutic delivery and eHealth systems
- Cross-national comparisons on health policy using evidence-based approaches
- National studies on health policy to determine the outcomes of technology-driven initiatives
- Cross-border eHealth including health tourism
- The digital divide in mobility, access and affordability of healthcare
- Health technology assessment (HTA) methods and tools for evaluating the effectiveness of clinical and non-clinical health technologies
- Health and eHealth indicators and benchmarks (measure/metrics) for understanding the adoption and diffusion of health technologies
- Health and eHealth models and frameworks to support policy-makers and other stakeholders in decision-making
- Stakeholder engagement with health technologies (clinical and patient/citizen buy-in)
- Regulation and health economics