{"title":"Evaluating the Reliability and Quality of Sarcoidosis-Related Information Provided by AI Chatbots.","authors":"Nur Aleyna Yetkin, Burcu Baran, Bilal Rabahoğlu, Nuri Tutar, İnci Gülmez","doi":"10.3390/healthcare13111344","DOIUrl":null,"url":null,"abstract":"<p><p><b>Background and Objectives:</b> Artificial intelligence (AI) chatbots are increasingly employed for the dissemination of health information; however, apprehensions regarding their accuracy and reliability remain. The intricacy of sarcoidosis may lead to misinformation and omissions that affect patient comprehension. This study assessed the usability of AI-generated information on sarcoidosis by evaluating the quality, reliability, readability, understandability, and actionability of chatbot responses to patient-centered queries. <b>Methods</b>: This cross-sectional evaluation included 11 AI chatbots comprising both general-purpose and retrieval-augmented tools. Four sarcoidosis-related queries derived from Google Trends were submitted to each chatbot under standardized conditions. Responses were independently evaluated by four blinded pulmonology experts using DISCERN, the Patient Education Materials Assessment Tool-Printable (PEMAT-P), and Flesch-Kincaid readability metrics. A Web Resource Rating (WRR) score was also calculated. Inter-rater reliability was assessed using intraclass correlation coefficients (ICCs). <b>Results</b>: Retrieval-augmented models such as ChatGPT-4o Deep Research, Perplexity Research, and Grok3 Deep Search outperformed general-purpose chatbots across the DISCERN, PEMAT-P, and WRR metrics. However, these high-performing models also produced text at significantly higher reading levels (Flesch-Kincaid Grade Level > 16), reducing accessibility. Actionability scores were consistently lower than understandability scores across all models. The ICCs exceeded 0.80 for all evaluation domains, indicating excellent inter-rater reliability. <b>Conclusions</b>: Although some AI chatbots can generate accurate and well-structured responses to sarcoidosis-related questions, their limited readability and low actionability present barriers for effective patient education. Optimization strategies, such as prompt refinement, health literacy adaptation, and domain-specific model development, are required to improve the utility of AI chatbots in complex disease communication.</p>","PeriodicalId":12977,"journal":{"name":"Healthcare","volume":"13 11","pages":""},"PeriodicalIF":2.4000,"publicationDate":"2025-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12154112/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Healthcare","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.3390/healthcare13111344","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0
Abstract
Background and Objectives: Artificial intelligence (AI) chatbots are increasingly employed for the dissemination of health information; however, apprehensions regarding their accuracy and reliability remain. The intricacy of sarcoidosis may lead to misinformation and omissions that affect patient comprehension. This study assessed the usability of AI-generated information on sarcoidosis by evaluating the quality, reliability, readability, understandability, and actionability of chatbot responses to patient-centered queries. Methods: This cross-sectional evaluation included 11 AI chatbots comprising both general-purpose and retrieval-augmented tools. Four sarcoidosis-related queries derived from Google Trends were submitted to each chatbot under standardized conditions. Responses were independently evaluated by four blinded pulmonology experts using DISCERN, the Patient Education Materials Assessment Tool-Printable (PEMAT-P), and Flesch-Kincaid readability metrics. A Web Resource Rating (WRR) score was also calculated. Inter-rater reliability was assessed using intraclass correlation coefficients (ICCs). Results: Retrieval-augmented models such as ChatGPT-4o Deep Research, Perplexity Research, and Grok3 Deep Search outperformed general-purpose chatbots across the DISCERN, PEMAT-P, and WRR metrics. However, these high-performing models also produced text at significantly higher reading levels (Flesch-Kincaid Grade Level > 16), reducing accessibility. Actionability scores were consistently lower than understandability scores across all models. The ICCs exceeded 0.80 for all evaluation domains, indicating excellent inter-rater reliability. Conclusions: Although some AI chatbots can generate accurate and well-structured responses to sarcoidosis-related questions, their limited readability and low actionability present barriers for effective patient education. Optimization strategies, such as prompt refinement, health literacy adaptation, and domain-specific model development, are required to improve the utility of AI chatbots in complex disease communication.
期刊介绍:
Healthcare (ISSN 2227-9032) is an international, peer-reviewed, open access journal (free for readers), which publishes original theoretical and empirical work in the interdisciplinary area of all aspects of medicine and health care research. Healthcare publishes Original Research Articles, Reviews, Case Reports, Research Notes and Short Communications. We encourage researchers to publish their experimental and theoretical results in as much detail as possible. For theoretical papers, full details of proofs must be provided so that the results can be checked; for experimental papers, full experimental details must be provided so that the results can be reproduced. Additionally, electronic files or software regarding the full details of the calculations, experimental procedure, etc., can be deposited along with the publication as “Supplementary Material”.