Supriya Dadi, Taylor Kring, Kyle Latz, David Cohen, Seth Thaller
{"title":"Evaluating the Reliability and Readability of AI Chatbot Responses for Microtia Patient Education.","authors":"Supriya Dadi, Taylor Kring, Kyle Latz, David Cohen, Seth Thaller","doi":"10.1097/SCS.0000000000011988","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>Ear microtia is a congenital deformity that can range from mild underdevelopment to complete absence of the external ear. Often unilateral, it causes visible facial asymmetry leading to psychosocial distress for patients and families. Caregivers report feeling guilty and anxious, while patients experience increased rates of depression and social challenges. This is often a difficult time for the patient and their families, who often turn to AI chatbots for guidance before and after receiving definitive surgical care. This study evaluates the quality and readability of leading AI-based chatbots when responding to patient-centered questions about the condition.</p><p><strong>Methods: </strong>Popular AI chatbots (ChatGPT 4o, Google Gemini, DeepSeek, and OpenEvidence) were asked 25 queries about microtia developed from the FAQ section on hospital websites. Responses were evaluated using modified DISCERN criteria for quality and SMOG scoring for readability. ANOVA and post hoc analyses were performed to identify significant differences.</p><p><strong>Results: </strong>Google Gemini achieved the highest DISCERN score (M=37.16, SD=2.58), followed by OpenEvidence (M=32.19, SD=3.54). DeepSeek (M=30.76, SD=4.29) and ChatGPT (M=30.32, SD=2.97) had the lowest DISCERN scores. OpenEvidence had the worst readability (M=18.06, SD=1.12), followed by ChatGPT (M=16.32, SD=1.41). DeepSeek was the most readable (M=14.63, SD=1.60), closely followed by Google Gemini (M=14.73, SD=1.27). Overall, the average DISCERN and SMOG scores across all platforms were 32.19 (SD=4.43) and 15.93 (SD=1.94), respectively, indicating a good quality and an undergraduate reading level.</p><p><strong>Conclusions: </strong>None of the platforms consistently met both quality and readability standards, though Google Gemini performed relatively well. As reliance on AI for early health information grows, ensuring the accessibility of chatbot responses will be crucial for supporting informed decision-making and enhancing the patient experience.</p>","PeriodicalId":15462,"journal":{"name":"Journal of Craniofacial Surgery","volume":" ","pages":""},"PeriodicalIF":1.0000,"publicationDate":"2025-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Craniofacial Surgery","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1097/SCS.0000000000011988","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"SURGERY","Score":null,"Total":0}
引用次数: 0
Abstract
Introduction: Ear microtia is a congenital deformity that can range from mild underdevelopment to complete absence of the external ear. Often unilateral, it causes visible facial asymmetry leading to psychosocial distress for patients and families. Caregivers report feeling guilty and anxious, while patients experience increased rates of depression and social challenges. This is often a difficult time for the patient and their families, who often turn to AI chatbots for guidance before and after receiving definitive surgical care. This study evaluates the quality and readability of leading AI-based chatbots when responding to patient-centered questions about the condition.
Methods: Popular AI chatbots (ChatGPT 4o, Google Gemini, DeepSeek, and OpenEvidence) were asked 25 queries about microtia developed from the FAQ section on hospital websites. Responses were evaluated using modified DISCERN criteria for quality and SMOG scoring for readability. ANOVA and post hoc analyses were performed to identify significant differences.
Results: Google Gemini achieved the highest DISCERN score (M=37.16, SD=2.58), followed by OpenEvidence (M=32.19, SD=3.54). DeepSeek (M=30.76, SD=4.29) and ChatGPT (M=30.32, SD=2.97) had the lowest DISCERN scores. OpenEvidence had the worst readability (M=18.06, SD=1.12), followed by ChatGPT (M=16.32, SD=1.41). DeepSeek was the most readable (M=14.63, SD=1.60), closely followed by Google Gemini (M=14.73, SD=1.27). Overall, the average DISCERN and SMOG scores across all platforms were 32.19 (SD=4.43) and 15.93 (SD=1.94), respectively, indicating a good quality and an undergraduate reading level.
Conclusions: None of the platforms consistently met both quality and readability standards, though Google Gemini performed relatively well. As reliance on AI for early health information grows, ensuring the accessibility of chatbot responses will be crucial for supporting informed decision-making and enhancing the patient experience.
期刊介绍:
The Journal of Craniofacial Surgery serves as a forum of communication for all those involved in craniofacial surgery, maxillofacial surgery and pediatric plastic surgery. Coverage ranges from practical aspects of craniofacial surgery to the basic science that underlies surgical practice. The journal publishes original articles, scientific reviews, editorials and invited commentary, abstracts and selected articles from international journals, and occasional international bibliographies in craniofacial surgery.