Assessing the accuracy and quality of artificial intelligence (AI) chatbot-generated responses in making patient-specific drug-therapy and healthcare-related decisions.

IF 3.3 3区 医学 Q2 MEDICAL INFORMATICS
Meron W Shiferaw, Taylor Zheng, Abigail Winter, Leigh Ann Mike, Lingtak-Neander Chan
{"title":"Assessing the accuracy and quality of artificial intelligence (AI) chatbot-generated responses in making patient-specific drug-therapy and healthcare-related decisions.","authors":"Meron W Shiferaw, Taylor Zheng, Abigail Winter, Leigh Ann Mike, Lingtak-Neander Chan","doi":"10.1186/s12911-024-02824-5","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Interactive artificial intelligence tools such as ChatGPT have gained popularity, yet little is known about their reliability as a reference tool for healthcare-related information for healthcare providers and trainees. The objective of this study was to assess the consistency, quality, and accuracy of the responses generated by ChatGPT on healthcare-related inquiries.</p><p><strong>Methods: </strong>A total of 18 open-ended questions including six questions in three defined clinical areas (2 each to address \"what\", \"why\", and \"how\", respectively) were submitted to ChatGPT v3.5 based on real-world usage experience. The experiment was conducted in duplicate using 2 computers. Five investigators independently ranked each response using a 4-point scale to rate the quality of the bot's responses. The Delphi method was used to compare each investigator's score with the goal of reaching at least 80% consistency. The accuracy of the responses was checked using established professional references and resources. When the responses were in question, the bot was asked to provide reference material used for the investigators to determine the accuracy and quality. The investigators determined the consistency, accuracy, and quality by establishing a consensus.</p><p><strong>Results: </strong>The speech pattern and length of the responses were consistent within the same user but different between users. Occasionally, ChatGPT provided 2 completely different responses to the same question. Overall, ChatGPT provided more accurate responses (8 out of 12) to the \"what\" questions with less reliable performance to the \"why\" and \"how\" questions. We identified errors in calculation, unit of measurement, and misuse of protocols by ChatGPT. Some of these errors could result in clinical decisions leading to harm. We also identified citations and references shown by ChatGPT that did not exist in the literature.</p><p><strong>Conclusions: </strong>ChatGPT is not ready to take on the coaching role for either healthcare learners or healthcare professionals. The lack of consistency in the responses to the same question is problematic for both learners and decision-makers. The intrinsic assumptions made by the chatbot could lead to erroneous clinical decisions. The unreliability in providing valid references is a serious flaw in using ChatGPT to drive clinical decision making.</p>","PeriodicalId":9340,"journal":{"name":"BMC Medical Informatics and Decision Making","volume":"24 1","pages":"404"},"PeriodicalIF":3.3000,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11668057/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Medical Informatics and Decision Making","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12911-024-02824-5","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Interactive artificial intelligence tools such as ChatGPT have gained popularity, yet little is known about their reliability as a reference tool for healthcare-related information for healthcare providers and trainees. The objective of this study was to assess the consistency, quality, and accuracy of the responses generated by ChatGPT on healthcare-related inquiries.

Methods: A total of 18 open-ended questions including six questions in three defined clinical areas (2 each to address "what", "why", and "how", respectively) were submitted to ChatGPT v3.5 based on real-world usage experience. The experiment was conducted in duplicate using 2 computers. Five investigators independently ranked each response using a 4-point scale to rate the quality of the bot's responses. The Delphi method was used to compare each investigator's score with the goal of reaching at least 80% consistency. The accuracy of the responses was checked using established professional references and resources. When the responses were in question, the bot was asked to provide reference material used for the investigators to determine the accuracy and quality. The investigators determined the consistency, accuracy, and quality by establishing a consensus.

Results: The speech pattern and length of the responses were consistent within the same user but different between users. Occasionally, ChatGPT provided 2 completely different responses to the same question. Overall, ChatGPT provided more accurate responses (8 out of 12) to the "what" questions with less reliable performance to the "why" and "how" questions. We identified errors in calculation, unit of measurement, and misuse of protocols by ChatGPT. Some of these errors could result in clinical decisions leading to harm. We also identified citations and references shown by ChatGPT that did not exist in the literature.

Conclusions: ChatGPT is not ready to take on the coaching role for either healthcare learners or healthcare professionals. The lack of consistency in the responses to the same question is problematic for both learners and decision-makers. The intrinsic assumptions made by the chatbot could lead to erroneous clinical decisions. The unreliability in providing valid references is a serious flaw in using ChatGPT to drive clinical decision making.

评估人工智能(AI)聊天机器人在制定针对患者的药物治疗和医疗保健相关决策时产生的反应的准确性和质量。
背景:像ChatGPT这样的交互式人工智能工具已经越来越受欢迎,但人们对它们作为医疗保健提供者和受训人员提供医疗保健相关信息的参考工具的可靠性知之甚少。本研究的目的是评估ChatGPT对医疗保健相关查询产生的响应的一致性、质量和准确性。方法:根据实际使用经验,向ChatGPT v3.5提交18个开放式问题,其中包括三个定义临床领域的6个问题(每个问题分别解决“什么”、“为什么”和“如何”)。实验用两台计算机分两份进行。五名调查人员使用4分制对每个回答进行独立排名,以评估机器人的回答质量。采用德尔菲法比较每位调查员的得分,目标是达到至少80%的一致性。使用已建立的专业参考文献和资源来检查回答的准确性。当回答有问题时,机器人被要求提供参考材料,供调查人员确定准确性和质量。研究者通过建立共识来确定一致性、准确性和质量。结果:同一用户的语音模式和长度一致,但不同用户之间存在差异。偶尔,ChatGPT会对同一个问题提供两种完全不同的回答。总的来说,ChatGPT对“什么”问题提供了更准确的回答(12分中的8分),而对“为什么”和“如何”问题的回答则不太可靠。我们发现了ChatGPT在计算、度量单位和协议滥用方面的错误。其中一些错误可能导致导致伤害的临床决策。我们还发现了ChatGPT显示的文献中不存在的引文和参考文献。结论:ChatGPT还没有准备好承担医疗保健学习者或医疗保健专业人员的指导角色。对同一个问题的回答缺乏一致性对学习者和决策者来说都是一个问题。聊天机器人的内在假设可能会导致错误的临床决策。提供有效参考的不可靠性是使用ChatGPT驱动临床决策的一个严重缺陷。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
7.20
自引率
5.70%
发文量
297
审稿时长
1 months
期刊介绍: BMC Medical Informatics and Decision Making is an open access journal publishing original peer-reviewed research articles in relation to the design, development, implementation, use, and evaluation of health information technologies and decision-making for human health.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信