Sama Anvari, Yung Lee, David Shiqiang Jin, Sarah Malone, Matthew Collins
{"title":"Artificial intelligence in hepatology: a comparative analysis of ChatGPT-4, Bing, and Bard at answering clinical questions.","authors":"Sama Anvari, Yung Lee, David Shiqiang Jin, Sarah Malone, Matthew Collins","doi":"10.1093/jcag/gwae055","DOIUrl":null,"url":null,"abstract":"<p><strong>Background and aims: </strong>The role of artificial intelligence (AI) in hepatology is rapidly expanding. However, the ability of AI chat models such as ChatGPT to accurately answer clinical questions remains unclear. This study aims to determine the ability of large language models (LLMs) to answer questions in hepatology, as well as compare the accuracy and quality of responses provided by different LLMs.</p><p><strong>Methods: </strong>Hepatology questions from the Digestive Diseases Self-Education Platform were entered into three LLMs (OpenAI's ChatGPT-4, Microsoft's Bing, and Google's Bard) between September 7 and 13, 2023. Questions were posed with and without multiple-choice answers. Generated responses were assessed based on accuracy and number of correct answers. Statistical analysis was performed to determine the number of correct responses per LLM per category.</p><p><strong>Results: </strong>A total of 144 questions were used to query the AI models. ChatGPT-4's accuracy was 62.3%, Bing's accuracy was 53.5%, and Bard's accuracy was 38.2% (<i>P</i> < .001) for multiple-choice questions. For open-ended questions, ChatGPT-4's accuracy was 44.4%, Bing's was 28.5%, and Bard's was 21.4% (<i>P</i> < .001). ChatGPT-4 and Bing attempted to answer 100% of the questions, whereas Bard was unable to answer 11.8% of the questions. All 3 LLMs provided a rationale in addition to an answer, as well as counselling where appropriate.</p><p><strong>Conclusions: </strong>LLMs demonstrate variable accuracy when answering clinical questions related to hepatology, though show comparable efficacy when presented with questions in an open-ended versus multiple choice (MCQ) format. Further research is required to investigate the optimal use of LLMs in clinical and educational contexts.</p>","PeriodicalId":17263,"journal":{"name":"Journal of the Canadian Association of Gastroenterology","volume":"8 2","pages":"58-62"},"PeriodicalIF":0.0000,"publicationDate":"2025-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11991870/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the Canadian Association of Gastroenterology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/jcag/gwae055","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/4/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Background and aims: The role of artificial intelligence (AI) in hepatology is rapidly expanding. However, the ability of AI chat models such as ChatGPT to accurately answer clinical questions remains unclear. This study aims to determine the ability of large language models (LLMs) to answer questions in hepatology, as well as compare the accuracy and quality of responses provided by different LLMs.
Methods: Hepatology questions from the Digestive Diseases Self-Education Platform were entered into three LLMs (OpenAI's ChatGPT-4, Microsoft's Bing, and Google's Bard) between September 7 and 13, 2023. Questions were posed with and without multiple-choice answers. Generated responses were assessed based on accuracy and number of correct answers. Statistical analysis was performed to determine the number of correct responses per LLM per category.
Results: A total of 144 questions were used to query the AI models. ChatGPT-4's accuracy was 62.3%, Bing's accuracy was 53.5%, and Bard's accuracy was 38.2% (P < .001) for multiple-choice questions. For open-ended questions, ChatGPT-4's accuracy was 44.4%, Bing's was 28.5%, and Bard's was 21.4% (P < .001). ChatGPT-4 and Bing attempted to answer 100% of the questions, whereas Bard was unable to answer 11.8% of the questions. All 3 LLMs provided a rationale in addition to an answer, as well as counselling where appropriate.
Conclusions: LLMs demonstrate variable accuracy when answering clinical questions related to hepatology, though show comparable efficacy when presented with questions in an open-ended versus multiple choice (MCQ) format. Further research is required to investigate the optimal use of LLMs in clinical and educational contexts.