Readability, accuracy and appropriateness and quality of AI chatbot responses as a patient information source on root canal retreatment: A comparative assessment

IF 3.7 2区 医学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS
Mine Büker, Gamze Mercan
{"title":"Readability, accuracy and appropriateness and quality of AI chatbot responses as a patient information source on root canal retreatment: A comparative assessment","authors":"Mine Büker,&nbsp;Gamze Mercan","doi":"10.1016/j.ijmedinf.2025.105948","DOIUrl":null,"url":null,"abstract":"<div><h3>Aim</h3><div>This study aimed to assess the readability, accuracy, appropriateness, and overall quality of responses generated by large language models (LLMs), including ChatGPT-3.5, Microsoft Copilot, and Gemini (Version 2.0 Flash), to frequently asked questions (FAQs) related to root canal retreatment.</div></div><div><h3>Methods</h3><div>Three LLM chatbots—ChatGPT-3.5, Microsoft Copilot, and Gemini (Version 2.0 Flash)—were assessed based on their responses to 10 patient FAQs. Readability was analyzed using seven indices, including Flesch reading ease score (FRES), Flesch-Kincaid grade level (FKGL), Simple Measure of Gobbledygook (SMOG), gunning FOG (GFOG), Linsear Write (LW), Coleman-Liau (CL), and automated readability index (ARI), and compared against the recommended sixth-grade reading level. Response quality was evaluated using the Global Quality Scale (GQS), while accuracy and appropriateness were rated on a five-point Likert scale by two independent reviewers. Statistical analyses were conducted using one-way ANOVA, Tukey or Games-Howell post-hoc tests for continuous variables. Spearman’s correlation test was used to assess associations between categorical variables.</div></div><div><h3>Results</h3><div>All chatbots generated responses exceeding the recommended readability level, making them suitable for readers at or above the 10th-grade level. No significant difference was found between ChatGPT-3.5 and Microsoft Copilot, while Gemini produced significantly more readable responses (p &lt; 0.05). Gemini demonstrated the highest proportion of accurate (80 %) and high-quality responses (80 %) compared to ChatGPT-3.5 and Microsoft Copilot.</div></div><div><h3>Conclusions</h3><div>None of the chatbots met the recommended readability standards for patient education materials. While Gemini demonstrated better readability, accuracy, and quality, all three models require further optimization to enhance accessibility and reliability in patient communication.</div></div>","PeriodicalId":54950,"journal":{"name":"International Journal of Medical Informatics","volume":"201 ","pages":"Article 105948"},"PeriodicalIF":3.7000,"publicationDate":"2025-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Medical Informatics","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1386505625001650","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Aim

This study aimed to assess the readability, accuracy, appropriateness, and overall quality of responses generated by large language models (LLMs), including ChatGPT-3.5, Microsoft Copilot, and Gemini (Version 2.0 Flash), to frequently asked questions (FAQs) related to root canal retreatment.

Methods

Three LLM chatbots—ChatGPT-3.5, Microsoft Copilot, and Gemini (Version 2.0 Flash)—were assessed based on their responses to 10 patient FAQs. Readability was analyzed using seven indices, including Flesch reading ease score (FRES), Flesch-Kincaid grade level (FKGL), Simple Measure of Gobbledygook (SMOG), gunning FOG (GFOG), Linsear Write (LW), Coleman-Liau (CL), and automated readability index (ARI), and compared against the recommended sixth-grade reading level. Response quality was evaluated using the Global Quality Scale (GQS), while accuracy and appropriateness were rated on a five-point Likert scale by two independent reviewers. Statistical analyses were conducted using one-way ANOVA, Tukey or Games-Howell post-hoc tests for continuous variables. Spearman’s correlation test was used to assess associations between categorical variables.

Results

All chatbots generated responses exceeding the recommended readability level, making them suitable for readers at or above the 10th-grade level. No significant difference was found between ChatGPT-3.5 and Microsoft Copilot, while Gemini produced significantly more readable responses (p < 0.05). Gemini demonstrated the highest proportion of accurate (80 %) and high-quality responses (80 %) compared to ChatGPT-3.5 and Microsoft Copilot.

Conclusions

None of the chatbots met the recommended readability standards for patient education materials. While Gemini demonstrated better readability, accuracy, and quality, all three models require further optimization to enhance accessibility and reliability in patient communication.
人工智能聊天机器人响应作为患者根管再治疗信息源的可读性、准确性、适宜性和质量:比较评估
目的本研究旨在评估大型语言模型(llm),包括ChatGPT-3.5、Microsoft Copilot和Gemini (Version 2.0 Flash),对与根管再治疗相关的常见问题(FAQs)产生的回答的可读性、准确性、适当性和整体质量。方法根据3个LLM聊天机器人chatgpt -3.5、Microsoft Copilot和Gemini (Version 2.0 Flash)对10个患者常见问题的回答进行评估。采用Flesch reading ease score (FRES)、Flesch- kincaid grade level (FKGL)、Simple Measure of Gobbledygook (SMOG)、gunning FOG (GFOG)、Linsear Write (LW)、Coleman-Liau (CL)、automated Readability index (ARI)等7项指标分析可读性,并与六年级推荐阅读水平进行比较。回答质量使用全球质量量表(GQS)进行评估,而准确性和适当性则由两名独立评论者根据李克特五点量表进行评分。对连续变量采用单因素方差分析、Tukey或Games-Howell事后检验进行统计分析。使用Spearman相关检验来评估分类变量之间的相关性。结果所有聊天机器人生成的回复都超过了推荐的可读性水平,适合10年级及以上的读者。ChatGPT-3.5和Microsoft Copilot之间没有发现显著差异,而Gemini产生的可读性明显更高(p <;0.05)。与ChatGPT-3.5和Microsoft Copilot相比,Gemini显示出最高的准确率(80%)和高质量反应(80%)。结论所有聊天机器人均未达到患者教育材料推荐的可读性标准。虽然Gemini显示出更好的可读性、准确性和质量,但这三种模型都需要进一步优化,以提高患者沟通的可访问性和可靠性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
International Journal of Medical Informatics
International Journal of Medical Informatics 医学-计算机:信息系统
CiteScore
8.90
自引率
4.10%
发文量
217
审稿时长
42 days
期刊介绍: International Journal of Medical Informatics provides an international medium for dissemination of original results and interpretative reviews concerning the field of medical informatics. The Journal emphasizes the evaluation of systems in healthcare settings. The scope of journal covers: Information systems, including national or international registration systems, hospital information systems, departmental and/or physician''s office systems, document handling systems, electronic medical record systems, standardization, systems integration etc.; Computer-aided medical decision support systems using heuristic, algorithmic and/or statistical methods as exemplified in decision theory, protocol development, artificial intelligence, etc. Educational computer based programs pertaining to medical informatics or medicine in general; Organizational, economic, social, clinical impact, ethical and cost-benefit aspects of IT applications in health care.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信