Artificial intelligence in hepatology: a comparative analysis of ChatGPT-4, Bing, and Bard at answering clinical questions.

IF 2.7

Journal of the Canadian Association of Gastroenterology Pub Date : 2025-02-22 eCollection Date: 2025-04-01 DOI:10.1093/jcag/gwae055

Sama Anvari, Yung Lee, David Shiqiang Jin, Sarah Malone, Matthew Collins

{"title":"Artificial intelligence in hepatology: a comparative analysis of ChatGPT-4, Bing, and Bard at answering clinical questions.","authors":"Sama Anvari, Yung Lee, David Shiqiang Jin, Sarah Malone, Matthew Collins","doi":"10.1093/jcag/gwae055","DOIUrl":null,"url":null,"abstract":"Background and aims: The role of artificial intelligence (AI) in hepatology is rapidly expanding. However, the ability of AI chat models such as ChatGPT to accurately answer clinical questions remains unclear. This study aims to determine the ability of large language models (LLMs) to answer questions in hepatology, as well as compare the accuracy and quality of responses provided by different LLMs.Methods: Hepatology questions from the Digestive Diseases Self-Education Platform were entered into three LLMs (OpenAI's ChatGPT-4, Microsoft's Bing, and Google's Bard) between September 7 and 13, 2023. Questions were posed with and without multiple-choice answers. Generated responses were assessed based on accuracy and number of correct answers. Statistical analysis was performed to determine the number of correct responses per LLM per category.Results: A total of 144 questions were used to query the AI models. ChatGPT-4's accuracy was 62.3%, Bing's accuracy was 53.5%, and Bard's accuracy was 38.2% (P < .001) for multiple-choice questions. For open-ended questions, ChatGPT-4's accuracy was 44.4%, Bing's was 28.5%, and Bard's was 21.4% (P < .001). ChatGPT-4 and Bing attempted to answer 100% of the questions, whereas Bard was unable to answer 11.8% of the questions. All 3 LLMs provided a rationale in addition to an answer, as well as counselling where appropriate.Conclusions: LLMs demonstrate variable accuracy when answering clinical questions related to hepatology, though show comparable efficacy when presented with questions in an open-ended versus multiple choice (MCQ) format. Further research is required to investigate the optimal use of LLMs in clinical and educational contexts.","PeriodicalId":17263,"journal":{"name":"Journal of the Canadian Association of Gastroenterology","volume":"8 2","pages":"58-62"},"PeriodicalIF":2.7000,"publicationDate":"2025-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11991870/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the Canadian Association of Gastroenterology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/jcag/gwae055","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/4/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Background and aims: The role of artificial intelligence (AI) in hepatology is rapidly expanding. However, the ability of AI chat models such as ChatGPT to accurately answer clinical questions remains unclear. This study aims to determine the ability of large language models (LLMs) to answer questions in hepatology, as well as compare the accuracy and quality of responses provided by different LLMs.

Methods: Hepatology questions from the Digestive Diseases Self-Education Platform were entered into three LLMs (OpenAI's ChatGPT-4, Microsoft's Bing, and Google's Bard) between September 7 and 13, 2023. Questions were posed with and without multiple-choice answers. Generated responses were assessed based on accuracy and number of correct answers. Statistical analysis was performed to determine the number of correct responses per LLM per category.

Results: A total of 144 questions were used to query the AI models. ChatGPT-4's accuracy was 62.3%, Bing's accuracy was 53.5%, and Bard's accuracy was 38.2% (P < .001) for multiple-choice questions. For open-ended questions, ChatGPT-4's accuracy was 44.4%, Bing's was 28.5%, and Bard's was 21.4% (P < .001). ChatGPT-4 and Bing attempted to answer 100% of the questions, whereas Bard was unable to answer 11.8% of the questions. All 3 LLMs provided a rationale in addition to an answer, as well as counselling where appropriate.

Conclusions: LLMs demonstrate variable accuracy when answering clinical questions related to hepatology, though show comparable efficacy when presented with questions in an open-ended versus multiple choice (MCQ) format. Further research is required to investigate the optimal use of LLMs in clinical and educational contexts.

查看原文本刊更多论文

肝病学中的人工智能：ChatGPT-4、Bing和Bard在回答临床问题方面的比较分析。

背景与目的：人工智能（AI）在肝病学中的作用正在迅速扩大。然而，像ChatGPT这样的人工智能聊天模型准确回答临床问题的能力尚不清楚。本研究旨在确定大型语言模型（llm）回答肝病学问题的能力，并比较不同llm提供的应答的准确性和质量。方法：在2023年9月7日至13日期间，将来自消化疾病自我教育平台的肝病学问题输入三个llm （OpenAI的ChatGPT-4，微软的Bing和b谷歌的Bard）。问题有选择题，也有不选择题。生成的回答根据正确答案的准确性和数量进行评估。进行统计分析以确定每个LLM每个类别的正确回答数量。结果：共使用了144个问题来查询AI模型。ChatGPT-4的准确率为62.3%，Bing的准确率为53.5%，Bard的准确率为38.2% （P < 0.001）。对于开放式问题，ChatGPT-4的准确率为44.4%，Bing的准确率为28.5%，Bard的准确率为21.4% （P < 0.001）。ChatGPT-4和必应试图回答100%的问题，而巴德无法回答11.8%的问题。所有3位法学硕士除了提供答案外，还提供了基本原理，并在适当的情况下提供了咨询。结论：llm在回答与肝病学相关的临床问题时表现出不同的准确性，尽管在回答开放式和多项选择（MCQ）格式的问题时表现出相当的疗效。需要进一步的研究来调查法学硕士在临床和教育环境中的最佳使用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of the Canadian Association of Gastroenterology

自引率

0.00%

发文量

296

审稿时长

10 weeks