Accuracy of Prospective Assessments of 4 Large Language Model Chatbot Responses to Patient Questions About Emergency Care: Experimental Comparative Study.

IF 5.8 2区医学 Q1 HEALTH CARE SCIENCES & SERVICES

Journal of Medical Internet Research Pub Date : 2024-11-04 DOI:10.2196/60291

Jonathan Yi-Shin Yau, Soheil Saadat, Edmund Hsu, Linda Suk-Ling Murphy, Jennifer S Roh, Jeffrey Suchard, Antonio Tapia, Warren Wiechmann, Mark I Langdorf

{"title":"Accuracy of Prospective Assessments of 4 Large Language Model Chatbot Responses to Patient Questions About Emergency Care: Experimental Comparative Study.","authors":"Jonathan Yi-Shin Yau, Soheil Saadat, Edmund Hsu, Linda Suk-Ling Murphy, Jennifer S Roh, Jeffrey Suchard, Antonio Tapia, Warren Wiechmann, Mark I Langdorf","doi":"10.2196/60291","DOIUrl":null,"url":null,"abstract":"Background: Recent surveys indicate that 48% of consumers actively use generative artificial intelligence (AI) for health-related inquiries. Despite widespread adoption and the potential to improve health care access, scant research examines the performance of AI chatbot responses regarding emergency care advice.Objective: We assessed the quality of AI chatbot responses to common emergency care questions. We sought to determine qualitative differences in responses from 4 free-access AI chatbots, for 10 different serious and benign emergency conditions.Methods: We created 10 emergency care questions that we fed into the free-access versions of ChatGPT 3.5 (OpenAI), Google Bard, Bing AI Chat (Microsoft), and Claude AI (Anthropic) on November 26, 2023. Each response was graded by 5 board-certified emergency medicine (EM) faculty for 8 domains of percentage accuracy, presence of dangerous information, factual accuracy, clarity, completeness, understandability, source reliability, and source relevancy. We determined the correct, complete response to the 10 questions from reputable and scholarly emergency medical references. These were compiled by an EM resident physician. For the readability of the chatbot responses, we used the Flesch-Kincaid Grade Level of each response from readability statistics embedded in Microsoft Word. Differences between chatbots were determined by the chi-square test.Results: Each of the 4 chatbots' responses to the 10 clinical questions were scored across 8 domains by 5 EM faculty, for 400 assessments for each chatbot. Together, the 4 chatbots had the best performance in clarity and understandability (both 85%), intermediate performance in accuracy and completeness (both 50%), and poor performance (10%) for source relevance and reliability (mostly unreported). Chatbots contained dangerous information in 5% to 35% of responses, with no statistical difference between chatbots on this metric (P=.24). ChatGPT, Google Bard, and Claud AI had similar performances across 6 out of 8 domains. Only Bing AI performed better with more identified or relevant sources (40%; the others had 0%-10%). Flesch-Kincaid Reading level was 7.7-8.9 grade for all chatbots, except ChatGPT at 10.8, which were all too advanced for average emergency patients. Responses included both dangerous (eg, starting cardiopulmonary resuscitation with no pulse check) and generally inappropriate advice (eg, loosening the collar to improve breathing without evidence of airway compromise).Conclusions: AI chatbots, though ubiquitous, have significant deficiencies in EM patient advice, despite relatively consistent performance. Information for when to seek urgent or emergent care is frequently incomplete and inaccurate, and patients may be unaware of misinformation. Sources are not generally provided. Patients who use AI to guide health care decisions assume potential risks. AI chatbots for health should be subject to further research, refinement, and regulation. We strongly recommend proper medical consultation to prevent potential adverse outcomes.","PeriodicalId":16337,"journal":{"name":"Journal of Medical Internet Research","volume":"26 ","pages":"e60291"},"PeriodicalIF":5.8000,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11574488/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Medical Internet Research","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2196/60291","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Recent surveys indicate that 48% of consumers actively use generative artificial intelligence (AI) for health-related inquiries. Despite widespread adoption and the potential to improve health care access, scant research examines the performance of AI chatbot responses regarding emergency care advice.

Objective: We assessed the quality of AI chatbot responses to common emergency care questions. We sought to determine qualitative differences in responses from 4 free-access AI chatbots, for 10 different serious and benign emergency conditions.

Methods: We created 10 emergency care questions that we fed into the free-access versions of ChatGPT 3.5 (OpenAI), Google Bard, Bing AI Chat (Microsoft), and Claude AI (Anthropic) on November 26, 2023. Each response was graded by 5 board-certified emergency medicine (EM) faculty for 8 domains of percentage accuracy, presence of dangerous information, factual accuracy, clarity, completeness, understandability, source reliability, and source relevancy. We determined the correct, complete response to the 10 questions from reputable and scholarly emergency medical references. These were compiled by an EM resident physician. For the readability of the chatbot responses, we used the Flesch-Kincaid Grade Level of each response from readability statistics embedded in Microsoft Word. Differences between chatbots were determined by the chi-square test.

Results: Each of the 4 chatbots' responses to the 10 clinical questions were scored across 8 domains by 5 EM faculty, for 400 assessments for each chatbot. Together, the 4 chatbots had the best performance in clarity and understandability (both 85%), intermediate performance in accuracy and completeness (both 50%), and poor performance (10%) for source relevance and reliability (mostly unreported). Chatbots contained dangerous information in 5% to 35% of responses, with no statistical difference between chatbots on this metric (P=.24). ChatGPT, Google Bard, and Claud AI had similar performances across 6 out of 8 domains. Only Bing AI performed better with more identified or relevant sources (40%; the others had 0%-10%). Flesch-Kincaid Reading level was 7.7-8.9 grade for all chatbots, except ChatGPT at 10.8, which were all too advanced for average emergency patients. Responses included both dangerous (eg, starting cardiopulmonary resuscitation with no pulse check) and generally inappropriate advice (eg, loosening the collar to improve breathing without evidence of airway compromise).

Conclusions: AI chatbots, though ubiquitous, have significant deficiencies in EM patient advice, despite relatively consistent performance. Information for when to seek urgent or emergent care is frequently incomplete and inaccurate, and patients may be unaware of misinformation. Sources are not generally provided. Patients who use AI to guide health care decisions assume potential risks. AI chatbots for health should be subject to further research, refinement, and regulation. We strongly recommend proper medical consultation to prevent potential adverse outcomes.

查看原文本刊更多论文

前瞻性评估 4 种大型语言模型聊天机器人对患者有关急诊护理问题的回复的准确性：实验对比研究

背景：最近的调查显示，48%的消费者积极使用生成式人工智能（AI）进行健康相关咨询。尽管人工智能被广泛采用，并有可能改善医疗服务的获取，但很少有研究对人工智能聊天机器人回答紧急护理建议的性能进行研究：我们评估了人工智能聊天机器人回答常见急诊问题的质量。我们试图确定 4 个免费访问的人工智能聊天机器人针对 10 种不同的严重和良性急诊情况所做回复的质量差异：我们创建了 10 个急诊护理问题，并于 2023 年 11 月 26 日将这些问题输入 ChatGPT 3.5（OpenAI）、Google Bard、Bing AI Chat（Microsoft）和 Claude AI（Anthropic）的免费访问版本。每个回复都由 5 位获得急诊医学（EM）委员会认证的教师从准确率、是否存在危险信息、事实准确性、清晰度、完整性、可理解性、来源可靠性和来源相关性 8 个方面进行评分。我们从声誉卓著的急诊医学学术参考文献中确定了对 10 个问题的正确、完整回答。这些参考文献由一名急诊科住院医师编写。对于聊天机器人回复的可读性，我们使用了微软 Word 中嵌入的可读性统计中每个回复的 Flesch-Kincaid 等级。聊天机器人之间的差异通过卡方检验确定：4个聊天机器人对10个临床问题的每个回答都由5位电磁学教师进行了8个领域的评分，每个聊天机器人共进行了400次评估。这 4 个聊天机器人在清晰度和可理解性方面表现最佳（均为 85%），在准确性和完整性方面表现中等（均为 50%），在来源相关性和可靠性方面表现较差（10%）（大部分未报告）。聊天机器人有 5% 到 35% 的回复包含危险信息，不同聊天机器人在这一指标上没有统计学差异（P=.24）。在 8 个领域中的 6 个领域，ChatGPT、Google Bard 和 Claud AI 的表现相似。只有 Bing AI 在识别或相关来源较多的情况下表现较好（40%；其他为 0%-10%）。所有聊天机器人的 Flesch-Kincaid 阅读水平均为 7.7-8.9 级，只有 ChatGPT 为 10.8 级，这对于普通急诊患者来说都太高了。回复中既有危险的建议（例如，在没有检查脉搏的情况下开始心肺复苏），也有一般不恰当的建议（例如，在没有气道受损证据的情况下松开衣领以改善呼吸）：结论：人工智能聊天机器人虽然无处不在，但在为急诊患者提供建议方面存在重大缺陷，尽管其性能相对稳定。关于何时寻求紧急或急诊护理的信息经常不完整、不准确，而且患者可能不知道错误信息。一般不会提供信息来源。使用人工智能指导医疗决策的患者会承担潜在风险。人工智能健康聊天机器人应进一步研究、完善和监管。我们强烈建议进行适当的医疗咨询，以防止潜在的不良后果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Medical Internet Research 医学-卫生保健

CiteScore

14.40

自引率

5.40%

发文量

654

审稿时长

1 months

期刊介绍： The Journal of Medical Internet Research (JMIR) is a highly respected publication in the field of health informatics and health services. With a founding date in 1999, JMIR has been a pioneer in the field for over two decades. As a leader in the industry, the journal focuses on digital health, data science, health informatics, and emerging technologies for health, medicine, and biomedical research. It is recognized as a top publication in these disciplines, ranking in the first quartile (Q1) by Impact Factor. Notably, JMIR holds the prestigious position of being ranked #1 on Google Scholar within the "Medical Informatics" discipline.