{"title":"人工智能聊天机器人响应作为患者根管再治疗信息源的可读性、准确性、适宜性和质量:比较评估","authors":"Mine Büker, Gamze Mercan","doi":"10.1016/j.ijmedinf.2025.105948","DOIUrl":null,"url":null,"abstract":"<div><h3>Aim</h3><div>This study aimed to assess the readability, accuracy, appropriateness, and overall quality of responses generated by large language models (LLMs), including ChatGPT-3.5, Microsoft Copilot, and Gemini (Version 2.0 Flash), to frequently asked questions (FAQs) related to root canal retreatment.</div></div><div><h3>Methods</h3><div>Three LLM chatbots—ChatGPT-3.5, Microsoft Copilot, and Gemini (Version 2.0 Flash)—were assessed based on their responses to 10 patient FAQs. Readability was analyzed using seven indices, including Flesch reading ease score (FRES), Flesch-Kincaid grade level (FKGL), Simple Measure of Gobbledygook (SMOG), gunning FOG (GFOG), Linsear Write (LW), Coleman-Liau (CL), and automated readability index (ARI), and compared against the recommended sixth-grade reading level. Response quality was evaluated using the Global Quality Scale (GQS), while accuracy and appropriateness were rated on a five-point Likert scale by two independent reviewers. Statistical analyses were conducted using one-way ANOVA, Tukey or Games-Howell post-hoc tests for continuous variables. Spearman’s correlation test was used to assess associations between categorical variables.</div></div><div><h3>Results</h3><div>All chatbots generated responses exceeding the recommended readability level, making them suitable for readers at or above the 10th-grade level. No significant difference was found between ChatGPT-3.5 and Microsoft Copilot, while Gemini produced significantly more readable responses (p < 0.05). Gemini demonstrated the highest proportion of accurate (80 %) and high-quality responses (80 %) compared to ChatGPT-3.5 and Microsoft Copilot.</div></div><div><h3>Conclusions</h3><div>None of the chatbots met the recommended readability standards for patient education materials. While Gemini demonstrated better readability, accuracy, and quality, all three models require further optimization to enhance accessibility and reliability in patient communication.</div></div>","PeriodicalId":54950,"journal":{"name":"International Journal of Medical Informatics","volume":"201 ","pages":"Article 105948"},"PeriodicalIF":3.7000,"publicationDate":"2025-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Readability, accuracy and appropriateness and quality of AI chatbot responses as a patient information source on root canal retreatment: A comparative assessment\",\"authors\":\"Mine Büker, Gamze Mercan\",\"doi\":\"10.1016/j.ijmedinf.2025.105948\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Aim</h3><div>This study aimed to assess the readability, accuracy, appropriateness, and overall quality of responses generated by large language models (LLMs), including ChatGPT-3.5, Microsoft Copilot, and Gemini (Version 2.0 Flash), to frequently asked questions (FAQs) related to root canal retreatment.</div></div><div><h3>Methods</h3><div>Three LLM chatbots—ChatGPT-3.5, Microsoft Copilot, and Gemini (Version 2.0 Flash)—were assessed based on their responses to 10 patient FAQs. Readability was analyzed using seven indices, including Flesch reading ease score (FRES), Flesch-Kincaid grade level (FKGL), Simple Measure of Gobbledygook (SMOG), gunning FOG (GFOG), Linsear Write (LW), Coleman-Liau (CL), and automated readability index (ARI), and compared against the recommended sixth-grade reading level. Response quality was evaluated using the Global Quality Scale (GQS), while accuracy and appropriateness were rated on a five-point Likert scale by two independent reviewers. Statistical analyses were conducted using one-way ANOVA, Tukey or Games-Howell post-hoc tests for continuous variables. Spearman’s correlation test was used to assess associations between categorical variables.</div></div><div><h3>Results</h3><div>All chatbots generated responses exceeding the recommended readability level, making them suitable for readers at or above the 10th-grade level. No significant difference was found between ChatGPT-3.5 and Microsoft Copilot, while Gemini produced significantly more readable responses (p < 0.05). Gemini demonstrated the highest proportion of accurate (80 %) and high-quality responses (80 %) compared to ChatGPT-3.5 and Microsoft Copilot.</div></div><div><h3>Conclusions</h3><div>None of the chatbots met the recommended readability standards for patient education materials. While Gemini demonstrated better readability, accuracy, and quality, all three models require further optimization to enhance accessibility and reliability in patient communication.</div></div>\",\"PeriodicalId\":54950,\"journal\":{\"name\":\"International Journal of Medical Informatics\",\"volume\":\"201 \",\"pages\":\"Article 105948\"},\"PeriodicalIF\":3.7000,\"publicationDate\":\"2025-04-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Medical Informatics\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1386505625001650\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Medical Informatics","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1386505625001650","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Readability, accuracy and appropriateness and quality of AI chatbot responses as a patient information source on root canal retreatment: A comparative assessment
Aim
This study aimed to assess the readability, accuracy, appropriateness, and overall quality of responses generated by large language models (LLMs), including ChatGPT-3.5, Microsoft Copilot, and Gemini (Version 2.0 Flash), to frequently asked questions (FAQs) related to root canal retreatment.
Methods
Three LLM chatbots—ChatGPT-3.5, Microsoft Copilot, and Gemini (Version 2.0 Flash)—were assessed based on their responses to 10 patient FAQs. Readability was analyzed using seven indices, including Flesch reading ease score (FRES), Flesch-Kincaid grade level (FKGL), Simple Measure of Gobbledygook (SMOG), gunning FOG (GFOG), Linsear Write (LW), Coleman-Liau (CL), and automated readability index (ARI), and compared against the recommended sixth-grade reading level. Response quality was evaluated using the Global Quality Scale (GQS), while accuracy and appropriateness were rated on a five-point Likert scale by two independent reviewers. Statistical analyses were conducted using one-way ANOVA, Tukey or Games-Howell post-hoc tests for continuous variables. Spearman’s correlation test was used to assess associations between categorical variables.
Results
All chatbots generated responses exceeding the recommended readability level, making them suitable for readers at or above the 10th-grade level. No significant difference was found between ChatGPT-3.5 and Microsoft Copilot, while Gemini produced significantly more readable responses (p < 0.05). Gemini demonstrated the highest proportion of accurate (80 %) and high-quality responses (80 %) compared to ChatGPT-3.5 and Microsoft Copilot.
Conclusions
None of the chatbots met the recommended readability standards for patient education materials. While Gemini demonstrated better readability, accuracy, and quality, all three models require further optimization to enhance accessibility and reliability in patient communication.
期刊介绍:
International Journal of Medical Informatics provides an international medium for dissemination of original results and interpretative reviews concerning the field of medical informatics. The Journal emphasizes the evaluation of systems in healthcare settings.
The scope of journal covers:
Information systems, including national or international registration systems, hospital information systems, departmental and/or physician''s office systems, document handling systems, electronic medical record systems, standardization, systems integration etc.;
Computer-aided medical decision support systems using heuristic, algorithmic and/or statistical methods as exemplified in decision theory, protocol development, artificial intelligence, etc.
Educational computer based programs pertaining to medical informatics or medicine in general;
Organizational, economic, social, clinical impact, ethical and cost-benefit aspects of IT applications in health care.