Niloufar Bineshfar, Chloe Shields, Natalia Davila, Sugi Panneerselvam, Tejus Pradeep, Marissa K Shoji, Wendy W Lee
{"title":"Evaluating large language models in answering patient questions about eye removal surgeries.","authors":"Niloufar Bineshfar, Chloe Shields, Natalia Davila, Sugi Panneerselvam, Tejus Pradeep, Marissa K Shoji, Wendy W Lee","doi":"10.1080/01676830.2025.2559735","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>To evaluate the performance of ChatGPT-4 and Gemini, two large language models (LLMs), in addressing frequently asked questions (FAQs) about eye removal surgeries.</p><p><strong>Methods: </strong>A set of 24 FAQs related to enucleation and evisceration was identified through a Google search and categorized into preoperative, procedural, and postoperative topics. Each question was submitted three times to ChatGPT-4o and Gemini, and responses were evaluated for consistency, accuracy, appropriateness, and potential harm. Readability was assessed using Flesch Reading Ease and Flesch-Kincaid Grade Level scores.</p><p><strong>Results: </strong>Gemini exhibited higher response consistency compared to ChatGPT (<i>p</i> = 0.043), while ChatGPT produced longer responses (mean length: 169.3 vs. 109.9 words; <i>p</i> < 0.001). Gemini's responses were more readable, with a higher Flesch Reading Ease score (39.0 vs. 31.3, <i>p</i> = 0.001) and lower Flesch-Kincaid Grade Level (11.6 vs. 14.0, <i>p</i> < 0.001). Both LLMs demonstrated comparable accuracy and low potential for harm, with 79.2% of Gemini responses and 77.1% of ChatGPT responses rated as completely correct. The sources cited by Gemini included academic institutions (91.7%) and medical practices (8.3%), while ChatGPT exclusively referenced academic sources.</p><p><strong>Conclusions: </strong>ChatGPT and Gemini showed comparable accuracy and low harm potential when addressing patient questions about eye removal surgeries. Gemini provided more consistent and readable responses, but both LLMs exceeded the recommended readability levels for patient education. These findings highlight the potential of LLMs to assist in patient communication and clinical education while underscoring the need for careful oversight in their implementation.</p>","PeriodicalId":47421,"journal":{"name":"Orbit-The International Journal on Orbital Disorders-Oculoplastic and Lacrimal Surgery","volume":" ","pages":"1-8"},"PeriodicalIF":0.8000,"publicationDate":"2025-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Orbit-The International Journal on Orbital Disorders-Oculoplastic and Lacrimal Surgery","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1080/01676830.2025.2559735","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"OPHTHALMOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Purpose: To evaluate the performance of ChatGPT-4 and Gemini, two large language models (LLMs), in addressing frequently asked questions (FAQs) about eye removal surgeries.
Methods: A set of 24 FAQs related to enucleation and evisceration was identified through a Google search and categorized into preoperative, procedural, and postoperative topics. Each question was submitted three times to ChatGPT-4o and Gemini, and responses were evaluated for consistency, accuracy, appropriateness, and potential harm. Readability was assessed using Flesch Reading Ease and Flesch-Kincaid Grade Level scores.
Results: Gemini exhibited higher response consistency compared to ChatGPT (p = 0.043), while ChatGPT produced longer responses (mean length: 169.3 vs. 109.9 words; p < 0.001). Gemini's responses were more readable, with a higher Flesch Reading Ease score (39.0 vs. 31.3, p = 0.001) and lower Flesch-Kincaid Grade Level (11.6 vs. 14.0, p < 0.001). Both LLMs demonstrated comparable accuracy and low potential for harm, with 79.2% of Gemini responses and 77.1% of ChatGPT responses rated as completely correct. The sources cited by Gemini included academic institutions (91.7%) and medical practices (8.3%), while ChatGPT exclusively referenced academic sources.
Conclusions: ChatGPT and Gemini showed comparable accuracy and low harm potential when addressing patient questions about eye removal surgeries. Gemini provided more consistent and readable responses, but both LLMs exceeded the recommended readability levels for patient education. These findings highlight the potential of LLMs to assist in patient communication and clinical education while underscoring the need for careful oversight in their implementation.
期刊介绍:
Orbit is the international medium covering developments and results from the variety of medical disciplines that overlap and converge in the field of orbital disorders: ophthalmology, otolaryngology, reconstructive and maxillofacial surgery, medicine and endocrinology, radiology, radiotherapy and oncology, neurology, neuroophthalmology and neurosurgery, pathology and immunology, haematology.