Evaluating Artificial Intelligence-Driven Responses to Acute Liver Failure Queries: A Comparative Analysis Across Accuracy, Clarity, and Relevance.

IF 8 1区 医学 Q1 GASTROENTEROLOGY & HEPATOLOGY
Sheza Malik, Lewis J Frey, Jason Gutman, Asim Mushtaq, Fatima Warraich, Kamran Qureshi
{"title":"Evaluating Artificial Intelligence-Driven Responses to Acute Liver Failure Queries: A Comparative Analysis Across Accuracy, Clarity, and Relevance.","authors":"Sheza Malik, Lewis J Frey, Jason Gutman, Asim Mushtaq, Fatima Warraich, Kamran Qureshi","doi":"10.14309/ajg.0000000000003255","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>Recent advancements in artificial intelligence (AI), particularly through the deployment of large language models (LLMs), have profoundly impacted healthcare. This study assesses 5 LLMs-ChatGPT 3.5, ChatGPT 4, BARD, CLAUDE, and COPILOT-on their response accuracy, clarity, and relevance to queries concerning acute liver failure (ALF). We subsequently compare these results with ChatGPT4 enhanced with retrieval augmented generation (RAG) technology.</p><p><strong>Methods: </strong>Based on real-world clinical use and the American College of Gastroenterology guidelines, we formulated 16 ALF questions or clinical scenarios to explore LLMs' ability to handle different clinical questions. Using the \"New Chat\" functionality, each query was processed individually across the models to reduce any bias. Additionally, we employed the RAG functionality of GPT-4, which integrates external sources as references to ground the results. All responses were evaluated on a Likert scale from 1 to 5 for accuracy, clarity, and relevance by 4 independent investigators to ensure impartiality.</p><p><strong>Results: </strong>ChatGPT 4, augmented with RAG, demonstrated superior performance compared with others, consistently scoring the highest (4.70, 4.89, 4.78) across all 3 domains. ChatGPT 4 exhibited notable proficiency, with scores of 3.67 in accuracy, 4.04 in clarity, and 4.01 in relevance. In contrast, CLAUDE achieved 3.04 in clarity, 3.6 in relevance, and 3.65 in accuracy. Meanwhile, BARD and COPILOT exhibited lower performance levels; BARD recorded scores of 2.01 in accuracy and 3.03 in relevance, while COPILOT obtained 2.26 in accuracy and 3.12 in relevance.</p><p><strong>Discussion: </strong>The study highlights Chat GPT 4 +RAG's superior performance compared with other LLMs. By integrating RAG with LLMs, the system combines generative language skills with accurate, up-to-date information. This improves response clarity, relevance, and accuracy, making them more effective in healthcare. However, AI models must continually evolve and align with medical practices for successful healthcare integration.</p>","PeriodicalId":7608,"journal":{"name":"American Journal of Gastroenterology","volume":" ","pages":""},"PeriodicalIF":8.0000,"publicationDate":"2024-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"American Journal of Gastroenterology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.14309/ajg.0000000000003255","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"GASTROENTEROLOGY & HEPATOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Introduction: Recent advancements in artificial intelligence (AI), particularly through the deployment of large language models (LLMs), have profoundly impacted healthcare. This study assesses 5 LLMs-ChatGPT 3.5, ChatGPT 4, BARD, CLAUDE, and COPILOT-on their response accuracy, clarity, and relevance to queries concerning acute liver failure (ALF). We subsequently compare these results with ChatGPT4 enhanced with retrieval augmented generation (RAG) technology.

Methods: Based on real-world clinical use and the American College of Gastroenterology guidelines, we formulated 16 ALF questions or clinical scenarios to explore LLMs' ability to handle different clinical questions. Using the "New Chat" functionality, each query was processed individually across the models to reduce any bias. Additionally, we employed the RAG functionality of GPT-4, which integrates external sources as references to ground the results. All responses were evaluated on a Likert scale from 1 to 5 for accuracy, clarity, and relevance by 4 independent investigators to ensure impartiality.

Results: ChatGPT 4, augmented with RAG, demonstrated superior performance compared with others, consistently scoring the highest (4.70, 4.89, 4.78) across all 3 domains. ChatGPT 4 exhibited notable proficiency, with scores of 3.67 in accuracy, 4.04 in clarity, and 4.01 in relevance. In contrast, CLAUDE achieved 3.04 in clarity, 3.6 in relevance, and 3.65 in accuracy. Meanwhile, BARD and COPILOT exhibited lower performance levels; BARD recorded scores of 2.01 in accuracy and 3.03 in relevance, while COPILOT obtained 2.26 in accuracy and 3.12 in relevance.

Discussion: The study highlights Chat GPT 4 +RAG's superior performance compared with other LLMs. By integrating RAG with LLMs, the system combines generative language skills with accurate, up-to-date information. This improves response clarity, relevance, and accuracy, making them more effective in healthcare. However, AI models must continually evolve and align with medical practices for successful healthcare integration.

评估人工智能对急性肝衰竭查询的响应:准确性、清晰度和相关性的比较分析。
人工智能(AI)的最新进展,特别是通过大型语言模型(llm)的部署,对医疗保健产生了深远的影响。本研究评估了五个llms——ChatGPT 3.5、ChatGPT 4、BARD、CLAUDE和copilot——对急性肝衰竭(ALF)问题的反应准确性、清晰度和相关性。随后,我们将这些结果与使用检索增强生成(RAG)技术增强的Chat GPT4进行比较。方法:基于实际临床应用和美国胃肠病学学会指南,我们制定了16个ALF问题或临床场景,以探讨LLMs处理不同临床问题的能力。使用“New Chat”功能,每个查询都在模型中单独处理,以减少任何偏差。此外,我们采用了GPT-4的RAG功能,它集成了外部资源作为参考来确定结果。为确保公正性,所有回答均由四名独立调查人员按照1至5的李克特量表对准确性、清晰度和相关性进行评估。结果:经过RAG增强的ChatGPT 4表现出优于其他版本的性能,在所有三个领域中得分始终最高(4.70,4.89,4.78)。ChatGPT 4表现出显著的熟练程度,准确性得分为3.67,清晰度得分为4.04,相关性得分为4.01。相比之下,克劳德的清晰度为3.04,相关性为3.6,准确性为3.65。同时,BARD和COPILOT表现出较低的性能水平;BARD的准确率为2.01分,相关性为3.03分,COPILOT的准确率为2.26分,相关性为3.12分。结论:本研究突出了Chat GPT 4 +RAG相对于其他LLMs的优越性能。通过将RAG与llm相结合,该系统将生成语言技能与准确的最新信息相结合。这提高了响应的清晰度、相关性和准确性,使其在医疗保健中更有效。然而,人工智能模型必须不断发展,并与医疗实践保持一致,才能实现成功的医疗保健整合。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
American Journal of Gastroenterology
American Journal of Gastroenterology 医学-胃肠肝病学
CiteScore
11.40
自引率
5.10%
发文量
458
审稿时长
12 months
期刊介绍: Published on behalf of the American College of Gastroenterology (ACG), The American Journal of Gastroenterology (AJG) stands as the foremost clinical journal in the fields of gastroenterology and hepatology. AJG offers practical and professional support to clinicians addressing the most prevalent gastroenterological disorders in patients.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信