人工智能聊天机器人对产妇哺乳支持的基准:质量,可读性和临床准确性的跨平台评估。

IF 2.7 4区 医学 Q2 HEALTH CARE SCIENCES & SERVICES
İlke Özer Aslan, Mustafa Törehan Aslan
{"title":"人工智能聊天机器人对产妇哺乳支持的基准:质量,可读性和临床准确性的跨平台评估。","authors":"İlke Özer Aslan, Mustafa Törehan Aslan","doi":"10.3390/healthcare13141756","DOIUrl":null,"url":null,"abstract":"<p><p><b>Background and Objective:</b> Large language model (LLM)-based chatbots are increasingly utilized by postpartum individuals seeking guidance on breastfeeding. However, the chatbots' content quality, readability, and alignment with clinical guidelines remain uncertain. This study was conducted to evaluate and compare the quality, readability, and factual accuracy of responses generated by three publicly accessible AI chatbots-ChatGPT-4o Pro, Gemini 2.5 Pro, and Copilot Pro-when prompted with common maternal questions related to breast-milk supply. <b>Methods:</b> Twenty frequently asked breastfeeding-related questions were submitted to each chatbot in separate sessions. The responses were paraphrased to enable standardized scoring and were then evaluated using three validated tools: ensuring quality information for patients (EQIP), the simple measure of gobbledygook (SMOG), and the global quality scale (GQS). Factual accuracy was benchmarked against WHO, ACOG, CDC, and NICE guidelines using a three-point rubric. Additional user experience metrics included response time, character count, content density, and structural formatting. Statistical comparisons were performed using the Kruskal-Wallis and Wilcoxon rank-sum tests with Bonferroni correction. <b>Results:</b> ChatGPT-4o Pro achieved the highest overall performance across all primary outcomes: EQIP score (85.7 ± 2.4%), SMOG score (9.78 ± 0.22), and GQS rating (4.55 ± 0.50), followed by Gemini 2.5 Pro and Copilot Pro (<i>p</i> < 0.001 for all comparisons). ChatGPT-4o Pro also demonstrated the highest factual alignment with clinical guidelines (95%), while Copilot showed more frequent omissions or simplifications. Differences in response time and formatting quality were statistically significant, although not always clinically meaningful. <b>Conclusions:</b> ChatGPT-4o Pro outperforms other chatbots in delivering structured, readable, and guideline-concordant breastfeeding information. However, substantial variability persists across the platforms, and none should be considered a substitute for professional guidance. Importantly, the phenomenon of AI hallucinations-where chatbots may generate factually incorrect or fabricated information-remains a critical risk that must be addressed to ensure safe integration into maternal health communication. Future efforts should focus on improving the transparency, accuracy, and multilingual reliability of AI chatbots to ensure their safe integration into maternal health communications.</p>","PeriodicalId":12977,"journal":{"name":"Healthcare","volume":"13 14","pages":""},"PeriodicalIF":2.7000,"publicationDate":"2025-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Benchmarking AI Chatbots for Maternal Lactation Support: A Cross-Platform Evaluation of Quality, Readability, and Clinical Accuracy.\",\"authors\":\"İlke Özer Aslan, Mustafa Törehan Aslan\",\"doi\":\"10.3390/healthcare13141756\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p><b>Background and Objective:</b> Large language model (LLM)-based chatbots are increasingly utilized by postpartum individuals seeking guidance on breastfeeding. However, the chatbots' content quality, readability, and alignment with clinical guidelines remain uncertain. This study was conducted to evaluate and compare the quality, readability, and factual accuracy of responses generated by three publicly accessible AI chatbots-ChatGPT-4o Pro, Gemini 2.5 Pro, and Copilot Pro-when prompted with common maternal questions related to breast-milk supply. <b>Methods:</b> Twenty frequently asked breastfeeding-related questions were submitted to each chatbot in separate sessions. The responses were paraphrased to enable standardized scoring and were then evaluated using three validated tools: ensuring quality information for patients (EQIP), the simple measure of gobbledygook (SMOG), and the global quality scale (GQS). Factual accuracy was benchmarked against WHO, ACOG, CDC, and NICE guidelines using a three-point rubric. Additional user experience metrics included response time, character count, content density, and structural formatting. Statistical comparisons were performed using the Kruskal-Wallis and Wilcoxon rank-sum tests with Bonferroni correction. <b>Results:</b> ChatGPT-4o Pro achieved the highest overall performance across all primary outcomes: EQIP score (85.7 ± 2.4%), SMOG score (9.78 ± 0.22), and GQS rating (4.55 ± 0.50), followed by Gemini 2.5 Pro and Copilot Pro (<i>p</i> < 0.001 for all comparisons). ChatGPT-4o Pro also demonstrated the highest factual alignment with clinical guidelines (95%), while Copilot showed more frequent omissions or simplifications. Differences in response time and formatting quality were statistically significant, although not always clinically meaningful. <b>Conclusions:</b> ChatGPT-4o Pro outperforms other chatbots in delivering structured, readable, and guideline-concordant breastfeeding information. However, substantial variability persists across the platforms, and none should be considered a substitute for professional guidance. Importantly, the phenomenon of AI hallucinations-where chatbots may generate factually incorrect or fabricated information-remains a critical risk that must be addressed to ensure safe integration into maternal health communication. Future efforts should focus on improving the transparency, accuracy, and multilingual reliability of AI chatbots to ensure their safe integration into maternal health communications.</p>\",\"PeriodicalId\":12977,\"journal\":{\"name\":\"Healthcare\",\"volume\":\"13 14\",\"pages\":\"\"},\"PeriodicalIF\":2.7000,\"publicationDate\":\"2025-07-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Healthcare\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.3390/healthcare13141756\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Healthcare","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.3390/healthcare13141756","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0

摘要

背景与目的:基于大语言模型(LLM)的聊天机器人越来越多地用于产后个体寻求母乳喂养指导。然而,聊天机器人的内容质量、可读性以及与临床指南的一致性仍然不确定。本研究旨在评估和比较三个可公开访问的人工智能聊天机器人(chatgpt - 40 Pro、Gemini 2.5 Pro和Copilot Pro)在被提示与母乳供应相关的常见产妇问题时所产生的回答的质量、可读性和事实准确性。方法:分别向每个聊天机器人提交20个与母乳喂养相关的常见问题。对这些回答进行解释以实现标准化评分,然后使用三种经过验证的工具进行评估:确保患者质量信息(EQIP)、简单的官样文章(SMOG)和全球质量量表(GQS)。事实准确性以WHO、ACOG、CDC和NICE指南为基准,采用三点标准。其他用户体验指标包括响应时间、字符数、内容密度和结构格式。采用Kruskal-Wallis和Wilcoxon秩和检验进行统计比较,并进行Bonferroni校正。结果:chatgpt - 40 Pro在所有主要结果中取得了最高的总体表现:EQIP评分(85.7±2.4%),SMOG评分(9.78±0.22)和GQS评分(4.55±0.50),其次是Gemini 2.5 Pro和Copilot Pro(所有比较的p < 0.001)。chatgpt - 40 Pro也显示出与临床指南的最高事实一致性(95%),而Copilot显示出更频繁的遗漏或简化。反应时间和格式质量的差异具有统计学意义,尽管并不总是具有临床意义。结论:chatgpt - 40 Pro在提供结构化、可读且符合指南的母乳喂养信息方面优于其他聊天机器人。然而,在不同的平台上仍然存在大量的可变性,没有一个可以被认为是专业指导的替代品。重要的是,人工智能幻觉现象——聊天机器人可能产生事实不正确或捏造的信息——仍然是一个必须解决的关键风险,以确保安全地融入孕产妇健康沟通。未来的工作应侧重于提高人工智能聊天机器人的透明度、准确性和多语言可靠性,以确保其安全融入孕产妇健康通信。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Benchmarking AI Chatbots for Maternal Lactation Support: A Cross-Platform Evaluation of Quality, Readability, and Clinical Accuracy.

Background and Objective: Large language model (LLM)-based chatbots are increasingly utilized by postpartum individuals seeking guidance on breastfeeding. However, the chatbots' content quality, readability, and alignment with clinical guidelines remain uncertain. This study was conducted to evaluate and compare the quality, readability, and factual accuracy of responses generated by three publicly accessible AI chatbots-ChatGPT-4o Pro, Gemini 2.5 Pro, and Copilot Pro-when prompted with common maternal questions related to breast-milk supply. Methods: Twenty frequently asked breastfeeding-related questions were submitted to each chatbot in separate sessions. The responses were paraphrased to enable standardized scoring and were then evaluated using three validated tools: ensuring quality information for patients (EQIP), the simple measure of gobbledygook (SMOG), and the global quality scale (GQS). Factual accuracy was benchmarked against WHO, ACOG, CDC, and NICE guidelines using a three-point rubric. Additional user experience metrics included response time, character count, content density, and structural formatting. Statistical comparisons were performed using the Kruskal-Wallis and Wilcoxon rank-sum tests with Bonferroni correction. Results: ChatGPT-4o Pro achieved the highest overall performance across all primary outcomes: EQIP score (85.7 ± 2.4%), SMOG score (9.78 ± 0.22), and GQS rating (4.55 ± 0.50), followed by Gemini 2.5 Pro and Copilot Pro (p < 0.001 for all comparisons). ChatGPT-4o Pro also demonstrated the highest factual alignment with clinical guidelines (95%), while Copilot showed more frequent omissions or simplifications. Differences in response time and formatting quality were statistically significant, although not always clinically meaningful. Conclusions: ChatGPT-4o Pro outperforms other chatbots in delivering structured, readable, and guideline-concordant breastfeeding information. However, substantial variability persists across the platforms, and none should be considered a substitute for professional guidance. Importantly, the phenomenon of AI hallucinations-where chatbots may generate factually incorrect or fabricated information-remains a critical risk that must be addressed to ensure safe integration into maternal health communication. Future efforts should focus on improving the transparency, accuracy, and multilingual reliability of AI chatbots to ensure their safe integration into maternal health communications.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Healthcare
Healthcare Medicine-Health Policy
CiteScore
3.50
自引率
7.10%
发文量
0
审稿时长
47 days
期刊介绍: Healthcare (ISSN 2227-9032) is an international, peer-reviewed, open access journal (free for readers), which publishes original theoretical and empirical work in the interdisciplinary area of all aspects of medicine and health care research. Healthcare publishes Original Research Articles, Reviews, Case Reports, Research Notes and Short Communications. We encourage researchers to publish their experimental and theoretical results in as much detail as possible. For theoretical papers, full details of proofs must be provided so that the results can be checked; for experimental papers, full experimental details must be provided so that the results can be reproduced. Additionally, electronic files or software regarding the full details of the calculations, experimental procedure, etc., can be deposited along with the publication as “Supplementary Material”.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信