Evaluating Artificial Intelligence-Driven Responses to Acute Liver Failure Queries: A Comparative Analysis Across Accuracy, Clarity, and Relevance.

IF 8 1区医学 Q1 GASTROENTEROLOGY & HEPATOLOGY

American Journal of Gastroenterology Pub Date : 2024-12-17 DOI:10.14309/ajg.0000000000003255

Sheza Malik, Lewis J Frey, Jason Gutman, Asim Mushtaq, Fatima Warraich, Kamran Qureshi

{"title":"Evaluating Artificial Intelligence-Driven Responses to Acute Liver Failure Queries: A Comparative Analysis Across Accuracy, Clarity, and Relevance.","authors":"Sheza Malik, Lewis J Frey, Jason Gutman, Asim Mushtaq, Fatima Warraich, Kamran Qureshi","doi":"10.14309/ajg.0000000000003255","DOIUrl":null,"url":null,"abstract":"Introduction: Recent advancements in artificial intelligence (AI), particularly through the deployment of large language models (LLMs), have profoundly impacted healthcare. This study assesses 5 LLMs-ChatGPT 3.5, ChatGPT 4, BARD, CLAUDE, and COPILOT-on their response accuracy, clarity, and relevance to queries concerning acute liver failure (ALF). We subsequently compare these results with ChatGPT4 enhanced with retrieval augmented generation (RAG) technology.Methods: Based on real-world clinical use and the American College of Gastroenterology guidelines, we formulated 16 ALF questions or clinical scenarios to explore LLMs' ability to handle different clinical questions. Using the \"New Chat\" functionality, each query was processed individually across the models to reduce any bias. Additionally, we employed the RAG functionality of GPT-4, which integrates external sources as references to ground the results. All responses were evaluated on a Likert scale from 1 to 5 for accuracy, clarity, and relevance by 4 independent investigators to ensure impartiality.Results: ChatGPT 4, augmented with RAG, demonstrated superior performance compared with others, consistently scoring the highest (4.70, 4.89, 4.78) across all 3 domains. ChatGPT 4 exhibited notable proficiency, with scores of 3.67 in accuracy, 4.04 in clarity, and 4.01 in relevance. In contrast, CLAUDE achieved 3.04 in clarity, 3.6 in relevance, and 3.65 in accuracy. Meanwhile, BARD and COPILOT exhibited lower performance levels; BARD recorded scores of 2.01 in accuracy and 3.03 in relevance, while COPILOT obtained 2.26 in accuracy and 3.12 in relevance.Discussion: The study highlights Chat GPT 4 +RAG's superior performance compared with other LLMs. By integrating RAG with LLMs, the system combines generative language skills with accurate, up-to-date information. This improves response clarity, relevance, and accuracy, making them more effective in healthcare. However, AI models must continually evolve and align with medical practices for successful healthcare integration.","PeriodicalId":7608,"journal":{"name":"American Journal of Gastroenterology","volume":" ","pages":""},"PeriodicalIF":8.0000,"publicationDate":"2024-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"American Journal of Gastroenterology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.14309/ajg.0000000000003255","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"GASTROENTEROLOGY & HEPATOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Introduction: Recent advancements in artificial intelligence (AI), particularly through the deployment of large language models (LLMs), have profoundly impacted healthcare. This study assesses 5 LLMs-ChatGPT 3.5, ChatGPT 4, BARD, CLAUDE, and COPILOT-on their response accuracy, clarity, and relevance to queries concerning acute liver failure (ALF). We subsequently compare these results with ChatGPT4 enhanced with retrieval augmented generation (RAG) technology.

Methods: Based on real-world clinical use and the American College of Gastroenterology guidelines, we formulated 16 ALF questions or clinical scenarios to explore LLMs' ability to handle different clinical questions. Using the "New Chat" functionality, each query was processed individually across the models to reduce any bias. Additionally, we employed the RAG functionality of GPT-4, which integrates external sources as references to ground the results. All responses were evaluated on a Likert scale from 1 to 5 for accuracy, clarity, and relevance by 4 independent investigators to ensure impartiality.

Results: ChatGPT 4, augmented with RAG, demonstrated superior performance compared with others, consistently scoring the highest (4.70, 4.89, 4.78) across all 3 domains. ChatGPT 4 exhibited notable proficiency, with scores of 3.67 in accuracy, 4.04 in clarity, and 4.01 in relevance. In contrast, CLAUDE achieved 3.04 in clarity, 3.6 in relevance, and 3.65 in accuracy. Meanwhile, BARD and COPILOT exhibited lower performance levels; BARD recorded scores of 2.01 in accuracy and 3.03 in relevance, while COPILOT obtained 2.26 in accuracy and 3.12 in relevance.

Discussion: The study highlights Chat GPT 4 +RAG's superior performance compared with other LLMs. By integrating RAG with LLMs, the system combines generative language skills with accurate, up-to-date information. This improves response clarity, relevance, and accuracy, making them more effective in healthcare. However, AI models must continually evolve and align with medical practices for successful healthcare integration.

查看原文本刊更多论文

评估人工智能对急性肝衰竭查询的响应：准确性、清晰度和相关性的比较分析。

人工智能（AI）的最新进展，特别是通过大型语言模型（llm）的部署，对医疗保健产生了深远的影响。本研究评估了五个llms——ChatGPT 3.5、ChatGPT 4、BARD、CLAUDE和copilot——对急性肝衰竭（ALF）问题的反应准确性、清晰度和相关性。随后，我们将这些结果与使用检索增强生成（RAG）技术增强的Chat GPT4进行比较。方法：基于实际临床应用和美国胃肠病学学会指南，我们制定了16个ALF问题或临床场景，以探讨LLMs处理不同临床问题的能力。使用“New Chat”功能，每个查询都在模型中单独处理，以减少任何偏差。此外，我们采用了GPT-4的RAG功能，它集成了外部资源作为参考来确定结果。为确保公正性，所有回答均由四名独立调查人员按照1至5的李克特量表对准确性、清晰度和相关性进行评估。结果：经过RAG增强的ChatGPT 4表现出优于其他版本的性能，在所有三个领域中得分始终最高（4.70,4.89,4.78）。ChatGPT 4表现出显著的熟练程度，准确性得分为3.67，清晰度得分为4.04，相关性得分为4.01。相比之下，克劳德的清晰度为3.04，相关性为3.6，准确性为3.65。同时，BARD和COPILOT表现出较低的性能水平；BARD的准确率为2.01分，相关性为3.03分，COPILOT的准确率为2.26分，相关性为3.12分。结论：本研究突出了Chat GPT 4 +RAG相对于其他LLMs的优越性能。通过将RAG与llm相结合，该系统将生成语言技能与准确的最新信息相结合。这提高了响应的清晰度、相关性和准确性，使其在医疗保健中更有效。然而，人工智能模型必须不断发展，并与医疗实践保持一致，才能实现成功的医疗保健整合。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

American Journal of Gastroenterology 医学-胃肠肝病学

CiteScore

11.40

自引率

5.10%

发文量

458

审稿时长

12 months

期刊介绍： Published on behalf of the American College of Gastroenterology (ACG), The American Journal of Gastroenterology (AJG) stands as the foremost clinical journal in the fields of gastroenterology and hepatology. AJG offers practical and professional support to clinicians addressing the most prevalent gastroenterological disorders in patients.