评估人工智能对急性肝衰竭查询的响应:准确性、清晰度和相关性的比较分析。

IF 8 1区 医学 Q1 GASTROENTEROLOGY & HEPATOLOGY
Sheza Malik, Lewis J Frey, Jason Gutman, Asim Mushtaq, Fatima Warraich, Kamran Qureshi
{"title":"评估人工智能对急性肝衰竭查询的响应:准确性、清晰度和相关性的比较分析。","authors":"Sheza Malik, Lewis J Frey, Jason Gutman, Asim Mushtaq, Fatima Warraich, Kamran Qureshi","doi":"10.14309/ajg.0000000000003255","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>Recent advancements in Artificial Intelligence (AI), particularly through the deployment of Large Language Models (LLMs), have profoundly impacted healthcare. This study assesses five LLMs-ChatGPT 3.5, ChatGPT 4, BARD, CLAUDE, and COPILOT-on their response accuracy, clarity, and relevance to queries concerning acute liver failure (ALF). We subsequently compare these results with Chat GPT4 enhanced with Retrieval Augmented Generation (RAG) technology.</p><p><strong>Methods: </strong>Based on real-world clinical use and the American College of Gastroenterology guidelines, we formulated 16 ALF questions or clinical scenarios to explore LLMs' ability to handle different clinical questions. Using the \"New Chat\" functionality, each query was processed individually across the models to reduce any bias. Additionally, we employed the RAG functionality of GPT-4, which integrates external sources as references to ground the results. All responses were evaluated on a Likert scale from 1 to 5 for accuracy, clarity, and relevance by four independent investigators to ensure impartiality.</p><p><strong>Result: </strong>ChatGPT 4, augmented with RAG, demonstrated superior performance compared to others, consistently scoring the highest (4.70, 4.89, 4.78) across all three domains. ChatGPT 4 exhibited notable proficiency, with scores of 3.67 in accuracy, 4.04 in clarity, and 4.01 in relevance. In contrast, CLAUDE achieved 3.04 in clarity, 3.6 in relevance, and 3.65 in accuracy. Meanwhile, BARD and COPILOT exhibited lower performance levels; BARD recorded scores of 2.01 in accuracy and 3.03 in relevance, while COPILOT obtained 2.26 in accuracy and 3.12 in relevance.</p><p><strong>Conclusion: </strong>The study highlights Chat GPT 4 +RAG's superior performance compared to other LLMs. By integrating RAG with LLMs, the system combines generative language skills with accurate, up-to-date information. This improves response clarity, relevance, and accuracy, making them more effective in healthcare. However, AI models must continually evolve and align with medical practices for successful healthcare integration.</p>","PeriodicalId":7608,"journal":{"name":"American Journal of Gastroenterology","volume":" ","pages":""},"PeriodicalIF":8.0000,"publicationDate":"2024-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluating Artificial Intelligence-Driven Responses to Acute Liver Failure Queries: A Comparative Analysis Across Accuracy, Clarity, and Relevance.\",\"authors\":\"Sheza Malik, Lewis J Frey, Jason Gutman, Asim Mushtaq, Fatima Warraich, Kamran Qureshi\",\"doi\":\"10.14309/ajg.0000000000003255\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Introduction: </strong>Recent advancements in Artificial Intelligence (AI), particularly through the deployment of Large Language Models (LLMs), have profoundly impacted healthcare. This study assesses five LLMs-ChatGPT 3.5, ChatGPT 4, BARD, CLAUDE, and COPILOT-on their response accuracy, clarity, and relevance to queries concerning acute liver failure (ALF). We subsequently compare these results with Chat GPT4 enhanced with Retrieval Augmented Generation (RAG) technology.</p><p><strong>Methods: </strong>Based on real-world clinical use and the American College of Gastroenterology guidelines, we formulated 16 ALF questions or clinical scenarios to explore LLMs' ability to handle different clinical questions. Using the \\\"New Chat\\\" functionality, each query was processed individually across the models to reduce any bias. Additionally, we employed the RAG functionality of GPT-4, which integrates external sources as references to ground the results. All responses were evaluated on a Likert scale from 1 to 5 for accuracy, clarity, and relevance by four independent investigators to ensure impartiality.</p><p><strong>Result: </strong>ChatGPT 4, augmented with RAG, demonstrated superior performance compared to others, consistently scoring the highest (4.70, 4.89, 4.78) across all three domains. ChatGPT 4 exhibited notable proficiency, with scores of 3.67 in accuracy, 4.04 in clarity, and 4.01 in relevance. In contrast, CLAUDE achieved 3.04 in clarity, 3.6 in relevance, and 3.65 in accuracy. Meanwhile, BARD and COPILOT exhibited lower performance levels; BARD recorded scores of 2.01 in accuracy and 3.03 in relevance, while COPILOT obtained 2.26 in accuracy and 3.12 in relevance.</p><p><strong>Conclusion: </strong>The study highlights Chat GPT 4 +RAG's superior performance compared to other LLMs. By integrating RAG with LLMs, the system combines generative language skills with accurate, up-to-date information. This improves response clarity, relevance, and accuracy, making them more effective in healthcare. However, AI models must continually evolve and align with medical practices for successful healthcare integration.</p>\",\"PeriodicalId\":7608,\"journal\":{\"name\":\"American Journal of Gastroenterology\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":8.0000,\"publicationDate\":\"2024-12-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"American Journal of Gastroenterology\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.14309/ajg.0000000000003255\",\"RegionNum\":1,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"GASTROENTEROLOGY & HEPATOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"American Journal of Gastroenterology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.14309/ajg.0000000000003255","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"GASTROENTEROLOGY & HEPATOLOGY","Score":null,"Total":0}
引用次数: 0

摘要

本文章由计算机程序翻译,如有差异,请以英文原文为准。
Evaluating Artificial Intelligence-Driven Responses to Acute Liver Failure Queries: A Comparative Analysis Across Accuracy, Clarity, and Relevance.

Introduction: Recent advancements in Artificial Intelligence (AI), particularly through the deployment of Large Language Models (LLMs), have profoundly impacted healthcare. This study assesses five LLMs-ChatGPT 3.5, ChatGPT 4, BARD, CLAUDE, and COPILOT-on their response accuracy, clarity, and relevance to queries concerning acute liver failure (ALF). We subsequently compare these results with Chat GPT4 enhanced with Retrieval Augmented Generation (RAG) technology.

Methods: Based on real-world clinical use and the American College of Gastroenterology guidelines, we formulated 16 ALF questions or clinical scenarios to explore LLMs' ability to handle different clinical questions. Using the "New Chat" functionality, each query was processed individually across the models to reduce any bias. Additionally, we employed the RAG functionality of GPT-4, which integrates external sources as references to ground the results. All responses were evaluated on a Likert scale from 1 to 5 for accuracy, clarity, and relevance by four independent investigators to ensure impartiality.

Result: ChatGPT 4, augmented with RAG, demonstrated superior performance compared to others, consistently scoring the highest (4.70, 4.89, 4.78) across all three domains. ChatGPT 4 exhibited notable proficiency, with scores of 3.67 in accuracy, 4.04 in clarity, and 4.01 in relevance. In contrast, CLAUDE achieved 3.04 in clarity, 3.6 in relevance, and 3.65 in accuracy. Meanwhile, BARD and COPILOT exhibited lower performance levels; BARD recorded scores of 2.01 in accuracy and 3.03 in relevance, while COPILOT obtained 2.26 in accuracy and 3.12 in relevance.

Conclusion: The study highlights Chat GPT 4 +RAG's superior performance compared to other LLMs. By integrating RAG with LLMs, the system combines generative language skills with accurate, up-to-date information. This improves response clarity, relevance, and accuracy, making them more effective in healthcare. However, AI models must continually evolve and align with medical practices for successful healthcare integration.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
American Journal of Gastroenterology
American Journal of Gastroenterology 医学-胃肠肝病学
CiteScore
11.40
自引率
5.10%
发文量
458
审稿时长
12 months
期刊介绍: Published on behalf of the American College of Gastroenterology (ACG), The American Journal of Gastroenterology (AJG) stands as the foremost clinical journal in the fields of gastroenterology and hepatology. AJG offers practical and professional support to clinicians addressing the most prevalent gastroenterological disorders in patients.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信