{"title":"Benchmarking AI Chatbots for Maternal Lactation Support: A Cross-Platform Evaluation of Quality, Readability, and Clinical Accuracy.","authors":"İlke Özer Aslan, Mustafa Törehan Aslan","doi":"10.3390/healthcare13141756","DOIUrl":null,"url":null,"abstract":"<p><p><b>Background and Objective:</b> Large language model (LLM)-based chatbots are increasingly utilized by postpartum individuals seeking guidance on breastfeeding. However, the chatbots' content quality, readability, and alignment with clinical guidelines remain uncertain. This study was conducted to evaluate and compare the quality, readability, and factual accuracy of responses generated by three publicly accessible AI chatbots-ChatGPT-4o Pro, Gemini 2.5 Pro, and Copilot Pro-when prompted with common maternal questions related to breast-milk supply. <b>Methods:</b> Twenty frequently asked breastfeeding-related questions were submitted to each chatbot in separate sessions. The responses were paraphrased to enable standardized scoring and were then evaluated using three validated tools: ensuring quality information for patients (EQIP), the simple measure of gobbledygook (SMOG), and the global quality scale (GQS). Factual accuracy was benchmarked against WHO, ACOG, CDC, and NICE guidelines using a three-point rubric. Additional user experience metrics included response time, character count, content density, and structural formatting. Statistical comparisons were performed using the Kruskal-Wallis and Wilcoxon rank-sum tests with Bonferroni correction. <b>Results:</b> ChatGPT-4o Pro achieved the highest overall performance across all primary outcomes: EQIP score (85.7 ± 2.4%), SMOG score (9.78 ± 0.22), and GQS rating (4.55 ± 0.50), followed by Gemini 2.5 Pro and Copilot Pro (<i>p</i> < 0.001 for all comparisons). ChatGPT-4o Pro also demonstrated the highest factual alignment with clinical guidelines (95%), while Copilot showed more frequent omissions or simplifications. Differences in response time and formatting quality were statistically significant, although not always clinically meaningful. <b>Conclusions:</b> ChatGPT-4o Pro outperforms other chatbots in delivering structured, readable, and guideline-concordant breastfeeding information. However, substantial variability persists across the platforms, and none should be considered a substitute for professional guidance. Importantly, the phenomenon of AI hallucinations-where chatbots may generate factually incorrect or fabricated information-remains a critical risk that must be addressed to ensure safe integration into maternal health communication. Future efforts should focus on improving the transparency, accuracy, and multilingual reliability of AI chatbots to ensure their safe integration into maternal health communications.</p>","PeriodicalId":12977,"journal":{"name":"Healthcare","volume":"13 14","pages":""},"PeriodicalIF":2.7000,"publicationDate":"2025-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Healthcare","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.3390/healthcare13141756","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0
Abstract
Background and Objective: Large language model (LLM)-based chatbots are increasingly utilized by postpartum individuals seeking guidance on breastfeeding. However, the chatbots' content quality, readability, and alignment with clinical guidelines remain uncertain. This study was conducted to evaluate and compare the quality, readability, and factual accuracy of responses generated by three publicly accessible AI chatbots-ChatGPT-4o Pro, Gemini 2.5 Pro, and Copilot Pro-when prompted with common maternal questions related to breast-milk supply. Methods: Twenty frequently asked breastfeeding-related questions were submitted to each chatbot in separate sessions. The responses were paraphrased to enable standardized scoring and were then evaluated using three validated tools: ensuring quality information for patients (EQIP), the simple measure of gobbledygook (SMOG), and the global quality scale (GQS). Factual accuracy was benchmarked against WHO, ACOG, CDC, and NICE guidelines using a three-point rubric. Additional user experience metrics included response time, character count, content density, and structural formatting. Statistical comparisons were performed using the Kruskal-Wallis and Wilcoxon rank-sum tests with Bonferroni correction. Results: ChatGPT-4o Pro achieved the highest overall performance across all primary outcomes: EQIP score (85.7 ± 2.4%), SMOG score (9.78 ± 0.22), and GQS rating (4.55 ± 0.50), followed by Gemini 2.5 Pro and Copilot Pro (p < 0.001 for all comparisons). ChatGPT-4o Pro also demonstrated the highest factual alignment with clinical guidelines (95%), while Copilot showed more frequent omissions or simplifications. Differences in response time and formatting quality were statistically significant, although not always clinically meaningful. Conclusions: ChatGPT-4o Pro outperforms other chatbots in delivering structured, readable, and guideline-concordant breastfeeding information. However, substantial variability persists across the platforms, and none should be considered a substitute for professional guidance. Importantly, the phenomenon of AI hallucinations-where chatbots may generate factually incorrect or fabricated information-remains a critical risk that must be addressed to ensure safe integration into maternal health communication. Future efforts should focus on improving the transparency, accuracy, and multilingual reliability of AI chatbots to ensure their safe integration into maternal health communications.
期刊介绍:
Healthcare (ISSN 2227-9032) is an international, peer-reviewed, open access journal (free for readers), which publishes original theoretical and empirical work in the interdisciplinary area of all aspects of medicine and health care research. Healthcare publishes Original Research Articles, Reviews, Case Reports, Research Notes and Short Communications. We encourage researchers to publish their experimental and theoretical results in as much detail as possible. For theoretical papers, full details of proofs must be provided so that the results can be checked; for experimental papers, full experimental details must be provided so that the results can be reproduced. Additionally, electronic files or software regarding the full details of the calculations, experimental procedure, etc., can be deposited along with the publication as “Supplementary Material”.