Artificial intelligence in hepatology: a comparative analysis of ChatGPT-4, Bing, and Bard at answering clinical questions.

Journal of the Canadian Association of Gastroenterology Pub Date : 2025-02-22 eCollection Date: 2025-04-01 DOI:10.1093/jcag/gwae055
Sama Anvari, Yung Lee, David Shiqiang Jin, Sarah Malone, Matthew Collins
{"title":"Artificial intelligence in hepatology: a comparative analysis of ChatGPT-4, Bing, and Bard at answering clinical questions.","authors":"Sama Anvari, Yung Lee, David Shiqiang Jin, Sarah Malone, Matthew Collins","doi":"10.1093/jcag/gwae055","DOIUrl":null,"url":null,"abstract":"<p><strong>Background and aims: </strong>The role of artificial intelligence (AI) in hepatology is rapidly expanding. However, the ability of AI chat models such as ChatGPT to accurately answer clinical questions remains unclear. This study aims to determine the ability of large language models (LLMs) to answer questions in hepatology, as well as compare the accuracy and quality of responses provided by different LLMs.</p><p><strong>Methods: </strong>Hepatology questions from the Digestive Diseases Self-Education Platform were entered into three LLMs (OpenAI's ChatGPT-4, Microsoft's Bing, and Google's Bard) between September 7 and 13, 2023. Questions were posed with and without multiple-choice answers. Generated responses were assessed based on accuracy and number of correct answers. Statistical analysis was performed to determine the number of correct responses per LLM per category.</p><p><strong>Results: </strong>A total of 144 questions were used to query the AI models. ChatGPT-4's accuracy was 62.3%, Bing's accuracy was 53.5%, and Bard's accuracy was 38.2% (<i>P</i> < .001) for multiple-choice questions. For open-ended questions, ChatGPT-4's accuracy was 44.4%, Bing's was 28.5%, and Bard's was 21.4% (<i>P</i> < .001). ChatGPT-4 and Bing attempted to answer 100% of the questions, whereas Bard was unable to answer 11.8% of the questions. All 3 LLMs provided a rationale in addition to an answer, as well as counselling where appropriate.</p><p><strong>Conclusions: </strong>LLMs demonstrate variable accuracy when answering clinical questions related to hepatology, though show comparable efficacy when presented with questions in an open-ended versus multiple choice (MCQ) format. Further research is required to investigate the optimal use of LLMs in clinical and educational contexts.</p>","PeriodicalId":17263,"journal":{"name":"Journal of the Canadian Association of Gastroenterology","volume":"8 2","pages":"58-62"},"PeriodicalIF":0.0000,"publicationDate":"2025-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11991870/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the Canadian Association of Gastroenterology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/jcag/gwae055","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/4/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Background and aims: The role of artificial intelligence (AI) in hepatology is rapidly expanding. However, the ability of AI chat models such as ChatGPT to accurately answer clinical questions remains unclear. This study aims to determine the ability of large language models (LLMs) to answer questions in hepatology, as well as compare the accuracy and quality of responses provided by different LLMs.

Methods: Hepatology questions from the Digestive Diseases Self-Education Platform were entered into three LLMs (OpenAI's ChatGPT-4, Microsoft's Bing, and Google's Bard) between September 7 and 13, 2023. Questions were posed with and without multiple-choice answers. Generated responses were assessed based on accuracy and number of correct answers. Statistical analysis was performed to determine the number of correct responses per LLM per category.

Results: A total of 144 questions were used to query the AI models. ChatGPT-4's accuracy was 62.3%, Bing's accuracy was 53.5%, and Bard's accuracy was 38.2% (P < .001) for multiple-choice questions. For open-ended questions, ChatGPT-4's accuracy was 44.4%, Bing's was 28.5%, and Bard's was 21.4% (P < .001). ChatGPT-4 and Bing attempted to answer 100% of the questions, whereas Bard was unable to answer 11.8% of the questions. All 3 LLMs provided a rationale in addition to an answer, as well as counselling where appropriate.

Conclusions: LLMs demonstrate variable accuracy when answering clinical questions related to hepatology, though show comparable efficacy when presented with questions in an open-ended versus multiple choice (MCQ) format. Further research is required to investigate the optimal use of LLMs in clinical and educational contexts.

肝病学中的人工智能:ChatGPT-4、Bing和Bard在回答临床问题方面的比较分析。
背景与目的:人工智能(AI)在肝病学中的作用正在迅速扩大。然而,像ChatGPT这样的人工智能聊天模型准确回答临床问题的能力尚不清楚。本研究旨在确定大型语言模型(llm)回答肝病学问题的能力,并比较不同llm提供的应答的准确性和质量。方法:在2023年9月7日至13日期间,将来自消化疾病自我教育平台的肝病学问题输入三个llm (OpenAI的ChatGPT-4,微软的Bing和b谷歌的Bard)。问题有选择题,也有不选择题。生成的回答根据正确答案的准确性和数量进行评估。进行统计分析以确定每个LLM每个类别的正确回答数量。结果:共使用了144个问题来查询AI模型。ChatGPT-4的准确率为62.3%,Bing的准确率为53.5%,Bard的准确率为38.2% (P < 0.001)。对于开放式问题,ChatGPT-4的准确率为44.4%,Bing的准确率为28.5%,Bard的准确率为21.4% (P < 0.001)。ChatGPT-4和必应试图回答100%的问题,而巴德无法回答11.8%的问题。所有3位法学硕士除了提供答案外,还提供了基本原理,并在适当的情况下提供了咨询。结论:llm在回答与肝病学相关的临床问题时表现出不同的准确性,尽管在回答开放式和多项选择(MCQ)格式的问题时表现出相当的疗效。需要进一步的研究来调查法学硕士在临床和教育环境中的最佳使用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
296
审稿时长
10 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信