Sai S Kurapati, Derek J Barnett, Antonio Yaghy, Cameron J Sabet, David N Younessi, Dang Nguyen, John C Lin, Ingrid U Scott
{"title":"Eyes on the Text: Assessing Readability of AI & Ophthalmologist Responses to Patient Surgery Queries.","authors":"Sai S Kurapati, Derek J Barnett, Antonio Yaghy, Cameron J Sabet, David N Younessi, Dang Nguyen, John C Lin, Ingrid U Scott","doi":"10.1159/000544917","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>Generative artificial intelligence (AI) technologies like GPT-4 can instantaneously provide health information to patients; however, the readability of these outputs compared to ophthalmologist-written responses is unknown. This study aims to evaluate the readability of GPT-4-generated and ophthalmologist-written responses to patient queries about ophthalmic surgery.</p><p><strong>Methods: </strong>This retrospective cross-sectional study used 200 randomly selected patient questions about ophthalmic surgery extracted from the American Academy of Ophthalmology's EyeSmart platform. The questions were inputted into GPT-4, and the generated responses were recorded. Ophthalmologist-written replies to the same questions were compiled for comparison. Readability of GPT-4 and ophthalmologist responses was assessed using six validated metrics: Flesch Kincaid Reading Ease (FK-RE), Flesch Kincaid Grade Level (FK-GL), Gunning Fog Score (GFS), SMOG Index (SI), Coleman Liau Index (CLI), and Automated Readability Index (ARI). Descriptive statistics, one-way ANOVA, Shapiro-Wilk, and Levene's tests (α=0.05) were used to compare readability between the two groups.</p><p><strong>Results: </strong>GPT-4 used a higher percentage of complex words (24.42%) compared to ophthalmologists (17.76%), although mean [SD] word count per sentence was similar (18.43 [2.95] and 18.01 [6.09]). Across all metrics (FK-RE; FK-GL; GFS; SI; CLI; and ARI), GPT-4 responses were at a higher grade level (34.39 [8.51]; 13.19 [2.63]; 16.37 [2.04]; 12.18 [1.43]; 15.72 [1.40]; 12.99 [1.86]) than ophthalmologists' responses (50.61 [15.53]; 10.71 [2.99]; 14.13 [3.55]; 10.07 [2.46]; 12.64 [2.93]; 10.40 [3.61]), with both sources necessitating a 12th-grade education for comprehension. ANOVA tests showed significance (p<0.05) for all comparisons except word count (p=0.438).</p><p><strong>Conclusions: </strong>The National Institutes of Health advises health information to be written at a sixth-seventh grade level. Both GPT-4- and ophthalmologist-written answers exceeded this recommendation, with GPT-4 showing a greater gap. Information accessibility is vital when designing patient resources, particularly with the rise of AI as an educational tool.</p>","PeriodicalId":19595,"journal":{"name":"Ophthalmologica","volume":" ","pages":"1-18"},"PeriodicalIF":2.1000,"publicationDate":"2025-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ophthalmologica","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1159/000544917","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"OPHTHALMOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Introduction: Generative artificial intelligence (AI) technologies like GPT-4 can instantaneously provide health information to patients; however, the readability of these outputs compared to ophthalmologist-written responses is unknown. This study aims to evaluate the readability of GPT-4-generated and ophthalmologist-written responses to patient queries about ophthalmic surgery.
Methods: This retrospective cross-sectional study used 200 randomly selected patient questions about ophthalmic surgery extracted from the American Academy of Ophthalmology's EyeSmart platform. The questions were inputted into GPT-4, and the generated responses were recorded. Ophthalmologist-written replies to the same questions were compiled for comparison. Readability of GPT-4 and ophthalmologist responses was assessed using six validated metrics: Flesch Kincaid Reading Ease (FK-RE), Flesch Kincaid Grade Level (FK-GL), Gunning Fog Score (GFS), SMOG Index (SI), Coleman Liau Index (CLI), and Automated Readability Index (ARI). Descriptive statistics, one-way ANOVA, Shapiro-Wilk, and Levene's tests (α=0.05) were used to compare readability between the two groups.
Results: GPT-4 used a higher percentage of complex words (24.42%) compared to ophthalmologists (17.76%), although mean [SD] word count per sentence was similar (18.43 [2.95] and 18.01 [6.09]). Across all metrics (FK-RE; FK-GL; GFS; SI; CLI; and ARI), GPT-4 responses were at a higher grade level (34.39 [8.51]; 13.19 [2.63]; 16.37 [2.04]; 12.18 [1.43]; 15.72 [1.40]; 12.99 [1.86]) than ophthalmologists' responses (50.61 [15.53]; 10.71 [2.99]; 14.13 [3.55]; 10.07 [2.46]; 12.64 [2.93]; 10.40 [3.61]), with both sources necessitating a 12th-grade education for comprehension. ANOVA tests showed significance (p<0.05) for all comparisons except word count (p=0.438).
Conclusions: The National Institutes of Health advises health information to be written at a sixth-seventh grade level. Both GPT-4- and ophthalmologist-written answers exceeded this recommendation, with GPT-4 showing a greater gap. Information accessibility is vital when designing patient resources, particularly with the rise of AI as an educational tool.
期刊介绍:
Published since 1899, ''Ophthalmologica'' has become a frequently cited guide to international work in clinical and experimental ophthalmology. It contains a selection of patient-oriented contributions covering the etiology of eye diseases, diagnostic techniques, and advances in medical and surgical treatment. Straightforward, factual reporting provides both interesting and useful reading. In addition to original papers, ''Ophthalmologica'' features regularly timely reviews in an effort to keep the reader well informed and updated. The large international circulation of this journal reflects its importance.