定制大语言模型提高准确性：比较检索增强生成和人工智能代理与循证医学非定制模型。

IF 4.4 1区医学 Q1 ORTHOPEDICS

Arthroscopy-The Journal of Arthroscopic and Related Surgery Pub Date : 2024-11-07 DOI:10.1016/j.arthro.2024.10.042

Joshua J. Woo B.S. , Andrew J. Yang B.S. , Reena J. Olsen M.S. , Sayyida S. Hasan B.S. , Danyal H. Nawabi M.D. , Benedict U. Nwachukwu M.D., M.B.A. , Riley J. Williams III M.D. , Prem N. Ramkumar M.D., M.B.A.

{"title":"定制大语言模型提高准确性：比较检索增强生成和人工智能代理与循证医学非定制模型。","authors":"Joshua J. Woo B.S. , Andrew J. Yang B.S. , Reena J. Olsen M.S. , Sayyida S. Hasan B.S. , Danyal H. Nawabi M.D. , Benedict U. Nwachukwu M.D., M.B.A. , Riley J. Williams III M.D. , Prem N. Ramkumar M.D., M.B.A.","doi":"10.1016/j.arthro.2024.10.042","DOIUrl":null,"url":null,"abstract":"<div><h3>Purpose</h3><div>To show the value of custom methods, namely Retrieval Augmented Generation (RAG)-based Large Language Models (LLMs) and Agentic Augmentation, over standard LLMs in delivering accurate information using an anterior cruciate ligament (ACL) injury case.</div></div><div><h3>Methods</h3><div>A set of 100 questions and answers based on the 2022 AAOS ACL guidelines were curated. Closed-source (open AI GPT4/GPT 3.5 and Anthropic’s Claude3) and open-source models (LLama3 8b/70b and Mistral 8×7b) were asked questions in base form and again with AAOS guidelines embedded into a RAG system. The top-performing models were further augmented with artificial intelligence (AI) agents and reevaluated. Two fellowship-trained surgeons blindly evaluated the accuracy of the responses of each cohort. Recall-Oriented Understudy of Gisting Evaluation and Metric for Evaluation of Translation with Explicit Ordering scores were calculated to assess semantic similarity in the response.</div></div><div><h3>Results</h3><div>All noncustom LLM models started below 60% accuracy. Applying RAG improved the accuracy of every model by an average 39.7%. The highest performing model with just RAG was Meta’s open-source Llama3 70b (94%). The highest performing model with RAG and AI agents was Open AI’s GPT4 (95%).</div></div><div><h3>Conclusions</h3><div>RAG improved accuracy by an average of 39.7%, with the highest accuracy rate of 94% in the Meta Llama3 70b. Incorporating AI agents into a previously RAG-augmented LLM improved ChatGPT4 accuracy rate to 95%. Thus, Agentic and RAG augmented LLMs can be accurate liaisons of information, supporting our hypothesis.</div></div><div><h3>Clinical Relevance</h3><div>Despite literature surrounding the use of LLM in medicine, there has been considerable and appropriate skepticism given the variably accurate response rates. This study establishes the groundwork to identify whether custom modifications to LLMs using RAG and agentic augmentation can better deliver accurate information in orthopaedic care. With this knowledge, online medical information commonly sought in popular LLMs, such as ChatGPT, can be standardized and provide relevant online medical information to better support shared decision making between surgeon and patient.</div></div>","PeriodicalId":55459,"journal":{"name":"Arthroscopy-The Journal of Arthroscopic and Related Surgery","volume":"41 3","pages":"Pages 565-573.e6"},"PeriodicalIF":4.4000,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Custom Large Language Models Improve Accuracy: Comparing Retrieval Augmented Generation and Artificial Intelligence Agents to Noncustom Models for Evidence-Based Medicine\",\"authors\":\"Joshua J. Woo B.S. , Andrew J. Yang B.S. , Reena J. Olsen M.S. , Sayyida S. Hasan B.S. , Danyal H. Nawabi M.D. , Benedict U. Nwachukwu M.D., M.B.A. , Riley J. Williams III M.D. , Prem N. Ramkumar M.D., M.B.A.\",\"doi\":\"10.1016/j.arthro.2024.10.042\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Purpose</h3><div>To show the value of custom methods, namely Retrieval Augmented Generation (RAG)-based Large Language Models (LLMs) and Agentic Augmentation, over standard LLMs in delivering accurate information using an anterior cruciate ligament (ACL) injury case.</div></div><div><h3>Methods</h3><div>A set of 100 questions and answers based on the 2022 AAOS ACL guidelines were curated. Closed-source (open AI GPT4/GPT 3.5 and Anthropic’s Claude3) and open-source models (LLama3 8b/70b and Mistral 8×7b) were asked questions in base form and again with AAOS guidelines embedded into a RAG system. The top-performing models were further augmented with artificial intelligence (AI) agents and reevaluated. Two fellowship-trained surgeons blindly evaluated the accuracy of the responses of each cohort. Recall-Oriented Understudy of Gisting Evaluation and Metric for Evaluation of Translation with Explicit Ordering scores were calculated to assess semantic similarity in the response.</div></div><div><h3>Results</h3><div>All noncustom LLM models started below 60% accuracy. Applying RAG improved the accuracy of every model by an average 39.7%. The highest performing model with just RAG was Meta’s open-source Llama3 70b (94%). The highest performing model with RAG and AI agents was Open AI’s GPT4 (95%).</div></div><div><h3>Conclusions</h3><div>RAG improved accuracy by an average of 39.7%, with the highest accuracy rate of 94% in the Meta Llama3 70b. Incorporating AI agents into a previously RAG-augmented LLM improved ChatGPT4 accuracy rate to 95%. Thus, Agentic and RAG augmented LLMs can be accurate liaisons of information, supporting our hypothesis.</div></div><div><h3>Clinical Relevance</h3><div>Despite literature surrounding the use of LLM in medicine, there has been considerable and appropriate skepticism given the variably accurate response rates. This study establishes the groundwork to identify whether custom modifications to LLMs using RAG and agentic augmentation can better deliver accurate information in orthopaedic care. With this knowledge, online medical information commonly sought in popular LLMs, such as ChatGPT, can be standardized and provide relevant online medical information to better support shared decision making between surgeon and patient.</div></div>\",\"PeriodicalId\":55459,\"journal\":{\"name\":\"Arthroscopy-The Journal of Arthroscopic and Related Surgery\",\"volume\":\"41 3\",\"pages\":\"Pages 565-573.e6\"},\"PeriodicalIF\":4.4000,\"publicationDate\":\"2024-11-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Arthroscopy-The Journal of Arthroscopic and Related Surgery\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0749806324008831\",\"RegionNum\":1,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ORTHOPEDICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Arthroscopy-The Journal of Arthroscopic and Related Surgery","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0749806324008831","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ORTHOPEDICS","Score":null,"Total":0}

引用次数: 0

摘要

目的：本研究的目的是利用一个前交叉韧带（ACL）损伤的案例，证明定制方法（即基于检索增强生成（RAG）的大语言模型（LLMs）和代理增强（Agentic Augmentation））在提供准确信息方面比标准 LLMs 更有价值：方法：根据 2022 年美国医学会前交叉韧带委员会指南，策划了一组 100 个问题和答案。对封闭源模型（Open AI GPT4/GPT 3.5 和 Anthropic's Claude3）和开放源模型（LLama3 8b/70b 和 Mistral8x7b）进行了基本提问，并将 AAOS 指南嵌入 RAG 系统。人工智能（AI）代理进一步增强了表现最佳的模型，并对其进行了重新评估。两名经过研究员培训的外科医生对每组回答的准确性进行了盲评。计算 ROUGE 和 METEOR 分数以评估响应的语义相似性：结果：所有非定制 LLM 模型的准确率都低于 60%。应用 RAG 后，每个模型的准确率平均提高了 39.7%。仅使用 RAG 的性能最高的模型是 Meta 的开源 Llama3 70b（94%）。使用 RAG 和人工智能代理的性能最高的模型是 Open AI 的 GPT4（95%）：RAG平均提高了39.7%的准确率，其中Meta Llama3 70b的准确率最高，达到94%。在之前的 RAG 增强 LLM 中加入人工智能代理，可将 ChatGPT4 的准确率提高到 95%。因此，Agentic 和 RAG 增强型 LLM 可以成为准确的信息联络工具，支持了我们的假设：尽管有文献围绕 LLM 在医学中的应用展开讨论，但鉴于其准确率参差不齐，人们对其持相当程度的怀疑态度。本研究为确定使用 RAG 和 Agentic 增强技术对 LLM 进行定制修改是否能更好地在骨科护理中提供准确信息奠定了基础。有了这些知识，就可以对 ChatGPT 等流行的 LLMs 中常见的在线医疗信息进行标准化，并提供相关的在线医疗信息，从而更好地支持外科医生和患者之间的共同决策。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Custom Large Language Models Improve Accuracy: Comparing Retrieval Augmented Generation and Artificial Intelligence Agents to Noncustom Models for Evidence-Based Medicine

Purpose

To show the value of custom methods, namely Retrieval Augmented Generation (RAG)-based Large Language Models (LLMs) and Agentic Augmentation, over standard LLMs in delivering accurate information using an anterior cruciate ligament (ACL) injury case.

Methods

A set of 100 questions and answers based on the 2022 AAOS ACL guidelines were curated. Closed-source (open AI GPT4/GPT 3.5 and Anthropic’s Claude3) and open-source models (LLama3 8b/70b and Mistral 8×7b) were asked questions in base form and again with AAOS guidelines embedded into a RAG system. The top-performing models were further augmented with artificial intelligence (AI) agents and reevaluated. Two fellowship-trained surgeons blindly evaluated the accuracy of the responses of each cohort. Recall-Oriented Understudy of Gisting Evaluation and Metric for Evaluation of Translation with Explicit Ordering scores were calculated to assess semantic similarity in the response.

Results

All noncustom LLM models started below 60% accuracy. Applying RAG improved the accuracy of every model by an average 39.7%. The highest performing model with just RAG was Meta’s open-source Llama3 70b (94%). The highest performing model with RAG and AI agents was Open AI’s GPT4 (95%).

Conclusions

RAG improved accuracy by an average of 39.7%, with the highest accuracy rate of 94% in the Meta Llama3 70b. Incorporating AI agents into a previously RAG-augmented LLM improved ChatGPT4 accuracy rate to 95%. Thus, Agentic and RAG augmented LLMs can be accurate liaisons of information, supporting our hypothesis.

Clinical Relevance

Despite literature surrounding the use of LLM in medicine, there has been considerable and appropriate skepticism given the variably accurate response rates. This study establishes the groundwork to identify whether custom modifications to LLMs using RAG and agentic augmentation can better deliver accurate information in orthopaedic care. With this knowledge, online medical information commonly sought in popular LLMs, such as ChatGPT, can be standardized and provide relevant online medical information to better support shared decision making between surgeon and patient.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Arthroscopy-The Journal of Arthroscopic and Related Surgery 医学-外科

CiteScore

9.30

自引率

17.00%

发文量

555

审稿时长

58 days

期刊介绍： Nowhere is minimally invasive surgery explained better than in Arthroscopy, the leading peer-reviewed journal in the field. Every issue enables you to put into perspective the usefulness of the various emerging arthroscopic techniques. The advantages and disadvantages of these methods -- along with their applications in various situations -- are discussed in relation to their efficiency, efficacy and cost benefit. As a special incentive, paid subscribers also receive access to the journal expanded website.