Assessing the accuracy and quality of artificial intelligence (AI) chatbot-generated responses in making patient-specific drug-therapy and healthcare-related decisions.

IF 3.3 3区医学 Q2 MEDICAL INFORMATICS

BMC Medical Informatics and Decision Making Pub Date : 2024-12-24 DOI:10.1186/s12911-024-02824-5

Meron W Shiferaw, Taylor Zheng, Abigail Winter, Leigh Ann Mike, Lingtak-Neander Chan

{"title":"Assessing the accuracy and quality of artificial intelligence (AI) chatbot-generated responses in making patient-specific drug-therapy and healthcare-related decisions.","authors":"Meron W Shiferaw, Taylor Zheng, Abigail Winter, Leigh Ann Mike, Lingtak-Neander Chan","doi":"10.1186/s12911-024-02824-5","DOIUrl":null,"url":null,"abstract":"Background: Interactive artificial intelligence tools such as ChatGPT have gained popularity, yet little is known about their reliability as a reference tool for healthcare-related information for healthcare providers and trainees. The objective of this study was to assess the consistency, quality, and accuracy of the responses generated by ChatGPT on healthcare-related inquiries.Methods: A total of 18 open-ended questions including six questions in three defined clinical areas (2 each to address \"what\", \"why\", and \"how\", respectively) were submitted to ChatGPT v3.5 based on real-world usage experience. The experiment was conducted in duplicate using 2 computers. Five investigators independently ranked each response using a 4-point scale to rate the quality of the bot's responses. The Delphi method was used to compare each investigator's score with the goal of reaching at least 80% consistency. The accuracy of the responses was checked using established professional references and resources. When the responses were in question, the bot was asked to provide reference material used for the investigators to determine the accuracy and quality. The investigators determined the consistency, accuracy, and quality by establishing a consensus.Results: The speech pattern and length of the responses were consistent within the same user but different between users. Occasionally, ChatGPT provided 2 completely different responses to the same question. Overall, ChatGPT provided more accurate responses (8 out of 12) to the \"what\" questions with less reliable performance to the \"why\" and \"how\" questions. We identified errors in calculation, unit of measurement, and misuse of protocols by ChatGPT. Some of these errors could result in clinical decisions leading to harm. We also identified citations and references shown by ChatGPT that did not exist in the literature.Conclusions: ChatGPT is not ready to take on the coaching role for either healthcare learners or healthcare professionals. The lack of consistency in the responses to the same question is problematic for both learners and decision-makers. The intrinsic assumptions made by the chatbot could lead to erroneous clinical decisions. The unreliability in providing valid references is a serious flaw in using ChatGPT to drive clinical decision making.","PeriodicalId":9340,"journal":{"name":"BMC Medical Informatics and Decision Making","volume":"24 1","pages":"404"},"PeriodicalIF":3.3000,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11668057/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Medical Informatics and Decision Making","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12911-024-02824-5","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Interactive artificial intelligence tools such as ChatGPT have gained popularity, yet little is known about their reliability as a reference tool for healthcare-related information for healthcare providers and trainees. The objective of this study was to assess the consistency, quality, and accuracy of the responses generated by ChatGPT on healthcare-related inquiries.

Methods: A total of 18 open-ended questions including six questions in three defined clinical areas (2 each to address "what", "why", and "how", respectively) were submitted to ChatGPT v3.5 based on real-world usage experience. The experiment was conducted in duplicate using 2 computers. Five investigators independently ranked each response using a 4-point scale to rate the quality of the bot's responses. The Delphi method was used to compare each investigator's score with the goal of reaching at least 80% consistency. The accuracy of the responses was checked using established professional references and resources. When the responses were in question, the bot was asked to provide reference material used for the investigators to determine the accuracy and quality. The investigators determined the consistency, accuracy, and quality by establishing a consensus.

Results: The speech pattern and length of the responses were consistent within the same user but different between users. Occasionally, ChatGPT provided 2 completely different responses to the same question. Overall, ChatGPT provided more accurate responses (8 out of 12) to the "what" questions with less reliable performance to the "why" and "how" questions. We identified errors in calculation, unit of measurement, and misuse of protocols by ChatGPT. Some of these errors could result in clinical decisions leading to harm. We also identified citations and references shown by ChatGPT that did not exist in the literature.

Conclusions: ChatGPT is not ready to take on the coaching role for either healthcare learners or healthcare professionals. The lack of consistency in the responses to the same question is problematic for both learners and decision-makers. The intrinsic assumptions made by the chatbot could lead to erroneous clinical decisions. The unreliability in providing valid references is a serious flaw in using ChatGPT to drive clinical decision making.

查看原文本刊更多论文

评估人工智能（AI）聊天机器人在制定针对患者的药物治疗和医疗保健相关决策时产生的反应的准确性和质量。

背景：像ChatGPT这样的交互式人工智能工具已经越来越受欢迎，但人们对它们作为医疗保健提供者和受训人员提供医疗保健相关信息的参考工具的可靠性知之甚少。本研究的目的是评估ChatGPT对医疗保健相关查询产生的响应的一致性、质量和准确性。方法：根据实际使用经验，向ChatGPT v3.5提交18个开放式问题，其中包括三个定义临床领域的6个问题（每个问题分别解决“什么”、“为什么”和“如何”）。实验用两台计算机分两份进行。五名调查人员使用4分制对每个回答进行独立排名，以评估机器人的回答质量。采用德尔菲法比较每位调查员的得分，目标是达到至少80%的一致性。使用已建立的专业参考文献和资源来检查回答的准确性。当回答有问题时，机器人被要求提供参考材料，供调查人员确定准确性和质量。研究者通过建立共识来确定一致性、准确性和质量。结果：同一用户的语音模式和长度一致，但不同用户之间存在差异。偶尔，ChatGPT会对同一个问题提供两种完全不同的回答。总的来说，ChatGPT对“什么”问题提供了更准确的回答（12分中的8分），而对“为什么”和“如何”问题的回答则不太可靠。我们发现了ChatGPT在计算、度量单位和协议滥用方面的错误。其中一些错误可能导致导致伤害的临床决策。我们还发现了ChatGPT显示的文献中不存在的引文和参考文献。结论：ChatGPT还没有准备好承担医疗保健学习者或医疗保健专业人员的指导角色。对同一个问题的回答缺乏一致性对学习者和决策者来说都是一个问题。聊天机器人的内在假设可能会导致错误的临床决策。提供有效参考的不可靠性是使用ChatGPT驱动临床决策的一个严重缺陷。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

BMC Medical Informatics and Decision Making 医学-医学：信息

CiteScore

7.20

自引率

5.70%

发文量

297

审稿时长

1 months

期刊介绍： BMC Medical Informatics and Decision Making is an open access journal publishing original peer-reviewed research articles in relation to the design, development, implementation, use, and evaluation of health information technologies and decision-making for human health.