Comparing physician and artificial intelligence chatbot responses to posthysterectomy questions posted to a public social media forum

Shadae K. Beale MD , Natalie Cohen MD , Beatrice Secheli MD , Donald McIntire PhD , Kimberly A. Kho MD, MPH
{"title":"Comparing physician and artificial intelligence chatbot responses to posthysterectomy questions posted to a public social media forum","authors":"Shadae K. Beale MD ,&nbsp;Natalie Cohen MD ,&nbsp;Beatrice Secheli MD ,&nbsp;Donald McIntire PhD ,&nbsp;Kimberly A. Kho MD, MPH","doi":"10.1016/j.xagr.2025.100553","DOIUrl":null,"url":null,"abstract":"<div><h3>BACKGROUND</h3><div>Within public online forums, patients often seek reassurance and guidance from the community regarding postoperative symptoms and expectations, and when to seek medical assistance. Others are using artificial intelligence in the form of online search engines or chatbots such as ChatGPT or Perplexity. Artificial intelligence chatbot assistants have been growing in popularity; however, clinicians may be hesitant to use them because of concerns about accuracy. The online networking service for medical professionals, Doximity, has expanded its resources to include a Health Insurance Portability and Accountability Act–compliant artificial intelligence writing assistant, Doximity GPT, designed to reduce the administrative burden on clinicians. Health professionals learn using a “medical model,” which greatly differs from the “health belief model” that laypeople learn through. This mismatch in learning perspectives likely contributes to a communication mismatch even during digital clinician–patient encounters, especially in patients with limited health literacy during the perioperative period when complications may arise.</div></div><div><h3>OBJECTIVE</h3><div>This study aimed to evaluate the ability of artificial intelligence chatbot assistants (Doximity GPT, Perplexity, and ChatGPT) to generate quality, accurate, and empathetic responses to postoperative patient queries that are also understandable and actionable.</div></div><div><h3>STUDY DESIGN</h3><div>Responses to 10 postoperative queries sourced from HysterSisters, a public forum for “woman-to-woman hysterectomy support,” were generated using 3 artificial intelligence chatbot assistants (Doximity GPT, Perplexity, and ChatGPT) and a minimally invasive gynecologic surgery fellowship–trained surgeon. Ten physician evaluators compared the blinded responses for quality, accuracy, and empathy. A separate pair of physician evaluators scored the responses for understandability and actionability using the Patient Education Materials Assessment Tool for Printable Materials (PEMAT-P). The final scores were the average of both reviewers’ scores. Analysis of variance was used for pairwise comparison of the evaluator scores between sources. Lastly, the Kruskal–Wallis test was used to analyze Flesch–Kincaid scoring for readability. The Pearson chi-square test was used to demonstrate the difference in reading level among the responses for each source.</div></div><div><h3>RESULTS</h3><div>Compared with a physician, Doximity GPT and ChatGPT were rated as more empathetic than a minimally invasive gynecologic surgeon, but quality and accuracy were similar across these sources. There was a significant difference between Perplexity and the other response sources, favoring the latter, for quality and accuracy (<em>P</em>&lt;.001). Perplexity and the minimally invasive gynecologic surgeon ranked similarly for empathy. Reading ease was greater for the minimally invasive gynecologic surgeon responses (60.6 [53.5–68.4]; eighth and ninth grade) than for Perplexity (40.0 [28.6–47.2], college) and ChatGPT (35.5 [28.2–42.0], college) (<em>P</em>&lt;.01). There was no significant difference in understandability and actionability, with all sources scored as having good understandability and average actionability.</div></div><div><h3>CONCLUSION</h3><div>As artificial intelligence chatbot assistants grow in popularity, including integration in the electronic health record, the output’s readability must reflect the general population’s health literacy to be impactful and effective. This analysis serves as a reminder for physicians to be mindful of this mismatch in readability and general health literacy when considering the integration of artificial intelligence chatbot assistants into patient care. The accuracy and consistency of these chatbots may also impact patient outcomes, making screening of utmost importance in this endeavor.</div></div>","PeriodicalId":72141,"journal":{"name":"AJOG global reports","volume":"5 3","pages":"Article 100553"},"PeriodicalIF":0.0000,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"AJOG global reports","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666577825001145","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

BACKGROUND

Within public online forums, patients often seek reassurance and guidance from the community regarding postoperative symptoms and expectations, and when to seek medical assistance. Others are using artificial intelligence in the form of online search engines or chatbots such as ChatGPT or Perplexity. Artificial intelligence chatbot assistants have been growing in popularity; however, clinicians may be hesitant to use them because of concerns about accuracy. The online networking service for medical professionals, Doximity, has expanded its resources to include a Health Insurance Portability and Accountability Act–compliant artificial intelligence writing assistant, Doximity GPT, designed to reduce the administrative burden on clinicians. Health professionals learn using a “medical model,” which greatly differs from the “health belief model” that laypeople learn through. This mismatch in learning perspectives likely contributes to a communication mismatch even during digital clinician–patient encounters, especially in patients with limited health literacy during the perioperative period when complications may arise.

OBJECTIVE

This study aimed to evaluate the ability of artificial intelligence chatbot assistants (Doximity GPT, Perplexity, and ChatGPT) to generate quality, accurate, and empathetic responses to postoperative patient queries that are also understandable and actionable.

STUDY DESIGN

Responses to 10 postoperative queries sourced from HysterSisters, a public forum for “woman-to-woman hysterectomy support,” were generated using 3 artificial intelligence chatbot assistants (Doximity GPT, Perplexity, and ChatGPT) and a minimally invasive gynecologic surgery fellowship–trained surgeon. Ten physician evaluators compared the blinded responses for quality, accuracy, and empathy. A separate pair of physician evaluators scored the responses for understandability and actionability using the Patient Education Materials Assessment Tool for Printable Materials (PEMAT-P). The final scores were the average of both reviewers’ scores. Analysis of variance was used for pairwise comparison of the evaluator scores between sources. Lastly, the Kruskal–Wallis test was used to analyze Flesch–Kincaid scoring for readability. The Pearson chi-square test was used to demonstrate the difference in reading level among the responses for each source.

RESULTS

Compared with a physician, Doximity GPT and ChatGPT were rated as more empathetic than a minimally invasive gynecologic surgeon, but quality and accuracy were similar across these sources. There was a significant difference between Perplexity and the other response sources, favoring the latter, for quality and accuracy (P<.001). Perplexity and the minimally invasive gynecologic surgeon ranked similarly for empathy. Reading ease was greater for the minimally invasive gynecologic surgeon responses (60.6 [53.5–68.4]; eighth and ninth grade) than for Perplexity (40.0 [28.6–47.2], college) and ChatGPT (35.5 [28.2–42.0], college) (P<.01). There was no significant difference in understandability and actionability, with all sources scored as having good understandability and average actionability.

CONCLUSION

As artificial intelligence chatbot assistants grow in popularity, including integration in the electronic health record, the output’s readability must reflect the general population’s health literacy to be impactful and effective. This analysis serves as a reminder for physicians to be mindful of this mismatch in readability and general health literacy when considering the integration of artificial intelligence chatbot assistants into patient care. The accuracy and consistency of these chatbots may also impact patient outcomes, making screening of utmost importance in this endeavor.
比较医生和人工智能聊天机器人对发布在公共社交媒体论坛上的乳房切除术后问题的回答
背景:在公共在线论坛中,患者通常会从社区中寻求关于术后症状和期望的安慰和指导,以及何时寻求医疗援助。其他公司则以在线搜索引擎或聊天机器人(如ChatGPT或Perplexity)的形式使用人工智能。人工智能聊天机器人助手越来越受欢迎;然而,由于担心准确性,临床医生可能会犹豫是否使用它们。为医疗专业人员提供的在线网络服务,Doximity,已经扩大了其资源,包括一个符合《健康保险可携带性和责任法案》的人工智能写作助手,Doximity GPT,旨在减轻临床医生的行政负担。卫生专业人员使用“医学模式”学习,这与外行人学习的“健康信念模式”有很大不同。这种学习视角的不匹配可能导致沟通不匹配,甚至在数字化临床-患者接触期间也是如此,特别是在可能出现并发症的围手术期,健康素养有限的患者。目的:本研究旨在评估人工智能聊天机器人助手(ximity GPT、Perplexity和ChatGPT)对术后患者询问产生高质量、准确和共情的反应的能力,这些反应也是可理解和可操作的。研究设计:通过3个人工智能聊天机器人助手(ximity GPT、Perplexity和ChatGPT)和一名接受过微创妇科外科奖学金培训的外科医生,对来自“女性对女性子宫切除术支持”公共论坛HysterSisters的10个术后问题进行了回复。10位医师评估人员比较了盲法回答的质量、准确性和同理心。另外一对医师评估员使用可打印材料患者教育材料评估工具(PEMAT-P)对可理解性和可操作性进行评分。最后的分数是两位评论者分数的平均值。采用方差分析两两比较来源间的评价者得分。最后,采用Kruskal-Wallis检验对Flesch-Kincaid评分进行可读性分析。使用Pearson卡方检验来证明每个来源的回答在阅读水平上的差异。结果与内科医生相比,Doximity GPT和ChatGPT被认为比微创妇科外科医生更有同理心,但这些来源的质量和准确性相似。在质量和准确性方面,Perplexity和其他回答源之间存在显著差异,更倾向于后者(P<.001)。“困惑”和微创妇科医生的同理心排名相似。微创妇科外科医生的阅读难易程度(60.6[53.5-68.4],八年级和九年级)高于Perplexity(40.0[28.6-47.2],大学)和ChatGPT(35.5[28.2-42.0],大学)(P< 0.01)。在可理解性和可操作性方面没有显著差异,所有来源都被评为具有良好的可理解性和平均可操作性。结论随着人工智能聊天机器人助手的普及,包括集成到电子健康记录中,输出的可读性必须反映普通人群的健康素养,才能产生影响和有效。这一分析提醒医生,在考虑将人工智能聊天机器人助手整合到患者护理中时,要注意可读性和一般健康素养的不匹配。这些聊天机器人的准确性和一致性也可能影响患者的治疗结果,因此筛查在这方面至关重要。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
AJOG global reports
AJOG global reports Endocrinology, Diabetes and Metabolism, Obstetrics, Gynecology and Women's Health, Perinatology, Pediatrics and Child Health, Urology
CiteScore
1.20
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信