Evaluation of four chatbots in autoimmune liver disease: A comparative analysis

IF 3.7 3区医学 Q2 GASTROENTEROLOGY & HEPATOLOGY

Annals of hepatology Pub Date : 2024-08-13 DOI:10.1016/j.aohep.2024.101537

Jimmy Daza , Lucas Soares Bezerra , Laura Santamaría , Roberto Rueda-Esteban , Heike Bantel , Marcos Girala , Matthias Ebert , Florian Van Bömmel , Andreas Geier , Andres Gomez Aldana , Kevin Yau , Mario Alvares-da-Silva , Markus Peck-Radosavljevic , Ezequiel Ridruejo , Arndt Weinmann , Andreas Teufel

{"title":"Evaluation of four chatbots in autoimmune liver disease: A comparative analysis","authors":"Jimmy Daza , Lucas Soares Bezerra , Laura Santamaría , Roberto Rueda-Esteban , Heike Bantel , Marcos Girala , Matthias Ebert , Florian Van Bömmel , Andreas Geier , Andres Gomez Aldana , Kevin Yau , Mario Alvares-da-Silva , Markus Peck-Radosavljevic , Ezequiel Ridruejo , Arndt Weinmann , Andreas Teufel","doi":"10.1016/j.aohep.2024.101537","DOIUrl":null,"url":null,"abstract":"<div><h3>Introduction and Objectives</h3><p>Autoimmune liver diseases (AILDs) are rare and require precise evaluation, which is often challenging for medical providers. Chatbots are innovative solutions to assist healthcare professionals in clinical management. In our study, ten liver specialists systematically evaluated four chatbots to determine their utility as clinical decision support tools in the field of AILDs.</p></div><div><h3>Materials and Methods</h3><p>We constructed a 56-question questionnaire focusing on AILD evaluation, diagnosis, and management of Autoimmune Hepatitis (AIH), Primary Biliary Cholangitis (PBC), and Primary Sclerosing Cholangitis (PSC). Four chatbots -ChatGPT 3.5, Claude, Microsoft Copilot, and Google Bard- were presented with the questions in their free tiers in December 2023. Responses underwent critical evaluation by ten liver specialists using a standardized 1 to 10 Likert scale. The analysis included mean scores, the number of highest-rated replies, and the identification of common shortcomings in chatbots performance.</p></div><div><h3>Results</h3><p>Among the assessed chatbots, specialists rated Claude highest with a mean score of 7.37 (<em>SD</em> = 1.91), followed by ChatGPT (7.17, <em>SD</em> = 1.89), Microsoft Copilot (6.63, <em>SD</em> = 2.10), and Google Bard (6.52, <em>SD</em> = 2.27). Claude also excelled with 27 best-rated replies, outperforming ChatGPT (20), while Microsoft Copilot and Google Bard lagged with only 6 and 9, respectively. Common deficiencies included listing details over specific advice, limited dosing options, inaccuracies for pregnant patients, insufficient recent data, over-reliance on CT and MRI imaging, and inadequate discussion regarding off-label use and fibrates in PBC treatment. Notably, internet access for Microsoft Copilot and Google Bard did not enhance precision compared to pre-trained models.</p></div><div><h3>Conclusions</h3><p>Chatbots hold promise in AILD support, but our study underscores key areas for improvement. Refinement is needed in providing specific advice, accuracy, and focused up-to-date information. Addressing these shortcomings is essential for enhancing the utility of chatbots in AILD management, guiding future development, and ensuring their effectiveness as clinical decision-support tools.</p></div>","PeriodicalId":7979,"journal":{"name":"Annals of hepatology","volume":"30 1","pages":"Article 101537"},"PeriodicalIF":3.7000,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1665268124003314/pdfft?md5=af7833dfb14ff08e21bb53cacf4381eb&pid=1-s2.0-S1665268124003314-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annals of hepatology","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1665268124003314","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"GASTROENTEROLOGY & HEPATOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Introduction and Objectives

Autoimmune liver diseases (AILDs) are rare and require precise evaluation, which is often challenging for medical providers. Chatbots are innovative solutions to assist healthcare professionals in clinical management. In our study, ten liver specialists systematically evaluated four chatbots to determine their utility as clinical decision support tools in the field of AILDs.

Materials and Methods

We constructed a 56-question questionnaire focusing on AILD evaluation, diagnosis, and management of Autoimmune Hepatitis (AIH), Primary Biliary Cholangitis (PBC), and Primary Sclerosing Cholangitis (PSC). Four chatbots -ChatGPT 3.5, Claude, Microsoft Copilot, and Google Bard- were presented with the questions in their free tiers in December 2023. Responses underwent critical evaluation by ten liver specialists using a standardized 1 to 10 Likert scale. The analysis included mean scores, the number of highest-rated replies, and the identification of common shortcomings in chatbots performance.

Results

Among the assessed chatbots, specialists rated Claude highest with a mean score of 7.37 (SD = 1.91), followed by ChatGPT (7.17, SD = 1.89), Microsoft Copilot (6.63, SD = 2.10), and Google Bard (6.52, SD = 2.27). Claude also excelled with 27 best-rated replies, outperforming ChatGPT (20), while Microsoft Copilot and Google Bard lagged with only 6 and 9, respectively. Common deficiencies included listing details over specific advice, limited dosing options, inaccuracies for pregnant patients, insufficient recent data, over-reliance on CT and MRI imaging, and inadequate discussion regarding off-label use and fibrates in PBC treatment. Notably, internet access for Microsoft Copilot and Google Bard did not enhance precision compared to pre-trained models.

Conclusions

Chatbots hold promise in AILD support, but our study underscores key areas for improvement. Refinement is needed in providing specific advice, accuracy, and focused up-to-date information. Addressing these shortcomings is essential for enhancing the utility of chatbots in AILD management, guiding future development, and ensuring their effectiveness as clinical decision-support tools.

查看原文本刊更多论文

评估自身免疫性肝病中的四种聊天机器人：对比分析

导言和目标：自身免疫性肝病（AILDs）十分罕见，需要精确的评估，这对医疗服务提供者来说往往具有挑战性。聊天机器人是协助医疗专业人员进行临床管理的创新解决方案。在我们的研究中，十位肝病专家系统地评估了四个聊天机器人，以确定它们作为 AILDs 领域临床决策支持工具的实用性：我们制作了一份 56 个问题的调查问卷，主要涉及自身免疫性肝炎（AIH）、原发性胆汁性胆管炎（PBC）和原发性硬化性胆管炎（PSC）的 AILD 评估、诊断和管理。2023 年 12 月，四个聊天机器人--ChatGPT 3.5、Claude、Microsoft Copilot 和 Google Bard--在其免费层级中提出了问题。十位肝病专家使用标准化的 1-10 分李克特量表对回答进行了严格评估。分析内容包括平均得分、最高评分回复的数量，以及识别聊天机器人性能中的常见缺陷：在接受评估的聊天机器人中，专家对 Claude 的评分最高，平均得分为 7.37（SD = 1.91），其次是 ChatGPT（7.17，SD = 1.89）、Microsoft Copilot（6.63，SD = 2.10）和 Google Bard（6.52，SD = 2.27）。Claude 也表现出色，有 27 条最佳回复，超过了 ChatGPT（20 条），而 Microsoft Copilot 和 Google Bard 落后，分别只有 6 条和 9 条。常见的不足之处包括：罗列的细节多于具体建议、剂量选择有限、对妊娠患者的建议不准确、近期数据不足、过度依赖 CT 和 MRI 成像、对 PBC 治疗中的标示外使用和纤维素类药物讨论不足。值得注意的是，与预先训练的模型相比，微软Copilot和谷歌Bard的互联网访问并没有提高精确度：聊天机器人在 AILD 支持方面大有可为，但我们的研究强调了需要改进的关键领域。在提供具体建议、准确性和有针对性的最新信息方面需要改进。解决这些不足对于提高聊天机器人在 AILD 管理中的实用性、指导未来发展以及确保其作为临床决策支持工具的有效性至关重要。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Annals of hepatology 医学-胃肠肝病学

CiteScore

7.90

自引率

2.60%

发文量

183

审稿时长

4-8 weeks

期刊介绍： Annals of Hepatology publishes original research on the biology and diseases of the liver in both humans and experimental models. Contributions may be submitted as regular articles. The journal also publishes concise reviews of both basic and clinical topics.