Sarah Jong, Qais A Dihan, Mohamed M Khodeiry, Ahmad Alzein, Christina Scelfo, Abdelrahman M Elhusseiny
{"title":"Evaluating Text-to-Image Generation in Pediatric Ophthalmology.","authors":"Sarah Jong, Qais A Dihan, Mohamed M Khodeiry, Ahmad Alzein, Christina Scelfo, Abdelrahman M Elhusseiny","doi":"10.3928/01913913-20250724-03","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>To evaluate the quality and accuracy of artificial intelligence (AI)-generated images depicting pediatric ophthalmology pathologies compared to human-illustrated images, and assess the readability, quality, and accuracy of accompanying AI-generated textual information.</p><p><strong>Methods: </strong>This cross-sectional comparative study analyzed outputs from DALL·E 3 (OpenAI) and Gemini Advanced (Google). Nine pediatric ophthalmology pathologies were sourced from the American Association for Pediatric Ophthalmology and Strabismus (AAPOS) \"Most Common Searches.\" Two prompts were used: Prompt A asked large language models (LLMs), \"What is [insert pathology]?\" Prompt B requested text-to-image generators (TTIs) to create images of the pathologies. Textual responses were evaluated for quality using published criteria (helpfulness, truthfulness, harmlessness; score 1 to 15, ≥ 12: high quality) and readability using Simple Measure of Gobbledygook (SMOG) and Flesch-Kincaid Grade Level (≤ 6th-grade level: readable). Images were assessed for anatomical accuracy, pathological accuracy, artifacts, and color (score 1 to 15, ≥ 12: high quality). Human-illustrated images served as controls.</p><p><strong>Results: </strong>DALL·E 3 images were of poor quality (median: 7; range: 3 to 15) and significantly worse than human-illustrated controls (median: 15; range: 9 to 15; <i>P</i> < .001). Pathological accuracy was also poor (median: 1). Textual information from ChatGPT-4o and Gemini Advanced was high quality (median: 15) but difficult to read (Chat-GPT-4o: SMOG: 8.2, FKGL: 8.9; Gemini Advanced: SMOG: 8.5, FKGL: 9.3).</p><p><strong>Conclusions: </strong>Text-to-image generators are poor at generating images of common pediatric ophthalmology pathologies. They can serve as adequate supplemental tools for generating high-quality accurate textual information, but care must be taken to tailor generated text to be readable by users.</p>","PeriodicalId":50095,"journal":{"name":"Journal of Pediatric Ophthalmology & Strabismus","volume":" ","pages":"1-7"},"PeriodicalIF":0.9000,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Pediatric Ophthalmology & Strabismus","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.3928/01913913-20250724-03","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"OPHTHALMOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Purpose: To evaluate the quality and accuracy of artificial intelligence (AI)-generated images depicting pediatric ophthalmology pathologies compared to human-illustrated images, and assess the readability, quality, and accuracy of accompanying AI-generated textual information.
Methods: This cross-sectional comparative study analyzed outputs from DALL·E 3 (OpenAI) and Gemini Advanced (Google). Nine pediatric ophthalmology pathologies were sourced from the American Association for Pediatric Ophthalmology and Strabismus (AAPOS) "Most Common Searches." Two prompts were used: Prompt A asked large language models (LLMs), "What is [insert pathology]?" Prompt B requested text-to-image generators (TTIs) to create images of the pathologies. Textual responses were evaluated for quality using published criteria (helpfulness, truthfulness, harmlessness; score 1 to 15, ≥ 12: high quality) and readability using Simple Measure of Gobbledygook (SMOG) and Flesch-Kincaid Grade Level (≤ 6th-grade level: readable). Images were assessed for anatomical accuracy, pathological accuracy, artifacts, and color (score 1 to 15, ≥ 12: high quality). Human-illustrated images served as controls.
Results: DALL·E 3 images were of poor quality (median: 7; range: 3 to 15) and significantly worse than human-illustrated controls (median: 15; range: 9 to 15; P < .001). Pathological accuracy was also poor (median: 1). Textual information from ChatGPT-4o and Gemini Advanced was high quality (median: 15) but difficult to read (Chat-GPT-4o: SMOG: 8.2, FKGL: 8.9; Gemini Advanced: SMOG: 8.5, FKGL: 9.3).
Conclusions: Text-to-image generators are poor at generating images of common pediatric ophthalmology pathologies. They can serve as adequate supplemental tools for generating high-quality accurate textual information, but care must be taken to tailor generated text to be readable by users.
期刊介绍:
The Journal of Pediatric Ophthalmology & Strabismus is a bimonthly peer-reviewed publication for pediatric ophthalmologists. The Journal has published original articles on the diagnosis, treatment, and prevention of eye disorders in the pediatric age group and the treatment of strabismus in all age groups for over 50 years.