Joshua J Woo, Andrew J Yang, Reena J Olsen, Sayyida S Hasan, Danyal H Nawabi, Benedict U Nwachukwu, Riley J Williams, Prem N Ramkumar
{"title":"定制大语言模型提高准确性:比较检索增强生成和人工智能代理与循证医学非定制模型。","authors":"Joshua J Woo, Andrew J Yang, Reena J Olsen, Sayyida S Hasan, Danyal H Nawabi, Benedict U Nwachukwu, Riley J Williams, Prem N Ramkumar","doi":"10.1016/j.arthro.2024.10.042","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>The purpose of the study is to demonstrate the value of custom methods, namely Retrieval Augmented Generation(RAG)-based Large Language Models(LLMs) and Agentic Augmentation, over standard LLMs in delivering accurate information using an anterior cruciate ligament(ACL) injury case.</p><p><strong>Methods: </strong>A set of 100 questions and answers based on the 2022 AAOS ACL guidelines were curated. Closed-source(Open AI GPT4/GPT 3.5 and Anthropic's Claude3) and open-source models(LLama3 8b/70b and Mistral8x7b) were asked questions in base form and again with AAOS guidelines embedded into a RAG system. The top-performing models were further augmented with Artificial Intelligence(AI) Agents and re-evaluated. Two fellowship-trained surgeons blindly evaluated the accuracy of the responses of each cohort. ROUGE and METEOR scores were calculated to assess semantic similarity in the response.</p><p><strong>Results: </strong>All non-custom LLM models started below 60% accuracy. Applying RAG improved the accuracy of every model by an average 39.7%. The highest performing model with just RAG was Meta's Open-Source Llama3 70b(94%). The highest performing model with RAG and AI Agents was Open AI's GPT4(95%).</p><p><strong>Conclusion: </strong>RAG improved accuracy by an average of 39.7%, with the highest accuracy rate of 94% in the Meta Llama3 70b. Incorporating AI agents into a previously RAG-augmented LLM improved ChatGPT4 accuracy rate to 95%. Thus, Agentic and RAG augmented LLMs can be accurate liaisons of information, supporting our hypothesis.</p><p><strong>Clinical relevance: </strong>Despite literature surrounding the use of LLM in medicine, there has been considerable and appropriate skepticism given the variably accurate response rates. This study establishes the groundwork to identify whether custom modifications to LLMs using RAG and Agentic augmentation can better deliver accurate information in orthopaedic care. With this knowledge, online medical information commonly sought in popular LLMs, such as ChatGPT, can be standardized and provide relevant online medical information to better support shared decision making between surgeon and patient.</p>","PeriodicalId":55459,"journal":{"name":"Arthroscopy-The Journal of Arthroscopic and Related Surgery","volume":" ","pages":""},"PeriodicalIF":4.4000,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Custom Large Language Models Improve Accuracy: Comparing Retrieval Augmented Generation and Artificial Intelligence Agents to Non-Custom Models for Evidence-Based Medicine.\",\"authors\":\"Joshua J Woo, Andrew J Yang, Reena J Olsen, Sayyida S Hasan, Danyal H Nawabi, Benedict U Nwachukwu, Riley J Williams, Prem N Ramkumar\",\"doi\":\"10.1016/j.arthro.2024.10.042\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Purpose: </strong>The purpose of the study is to demonstrate the value of custom methods, namely Retrieval Augmented Generation(RAG)-based Large Language Models(LLMs) and Agentic Augmentation, over standard LLMs in delivering accurate information using an anterior cruciate ligament(ACL) injury case.</p><p><strong>Methods: </strong>A set of 100 questions and answers based on the 2022 AAOS ACL guidelines were curated. Closed-source(Open AI GPT4/GPT 3.5 and Anthropic's Claude3) and open-source models(LLama3 8b/70b and Mistral8x7b) were asked questions in base form and again with AAOS guidelines embedded into a RAG system. The top-performing models were further augmented with Artificial Intelligence(AI) Agents and re-evaluated. Two fellowship-trained surgeons blindly evaluated the accuracy of the responses of each cohort. ROUGE and METEOR scores were calculated to assess semantic similarity in the response.</p><p><strong>Results: </strong>All non-custom LLM models started below 60% accuracy. Applying RAG improved the accuracy of every model by an average 39.7%. The highest performing model with just RAG was Meta's Open-Source Llama3 70b(94%). The highest performing model with RAG and AI Agents was Open AI's GPT4(95%).</p><p><strong>Conclusion: </strong>RAG improved accuracy by an average of 39.7%, with the highest accuracy rate of 94% in the Meta Llama3 70b. Incorporating AI agents into a previously RAG-augmented LLM improved ChatGPT4 accuracy rate to 95%. Thus, Agentic and RAG augmented LLMs can be accurate liaisons of information, supporting our hypothesis.</p><p><strong>Clinical relevance: </strong>Despite literature surrounding the use of LLM in medicine, there has been considerable and appropriate skepticism given the variably accurate response rates. This study establishes the groundwork to identify whether custom modifications to LLMs using RAG and Agentic augmentation can better deliver accurate information in orthopaedic care. With this knowledge, online medical information commonly sought in popular LLMs, such as ChatGPT, can be standardized and provide relevant online medical information to better support shared decision making between surgeon and patient.</p>\",\"PeriodicalId\":55459,\"journal\":{\"name\":\"Arthroscopy-The Journal of Arthroscopic and Related Surgery\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":4.4000,\"publicationDate\":\"2024-11-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Arthroscopy-The Journal of Arthroscopic and Related Surgery\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1016/j.arthro.2024.10.042\",\"RegionNum\":1,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ORTHOPEDICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Arthroscopy-The Journal of Arthroscopic and Related Surgery","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.arthro.2024.10.042","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ORTHOPEDICS","Score":null,"Total":0}
Custom Large Language Models Improve Accuracy: Comparing Retrieval Augmented Generation and Artificial Intelligence Agents to Non-Custom Models for Evidence-Based Medicine.
Purpose: The purpose of the study is to demonstrate the value of custom methods, namely Retrieval Augmented Generation(RAG)-based Large Language Models(LLMs) and Agentic Augmentation, over standard LLMs in delivering accurate information using an anterior cruciate ligament(ACL) injury case.
Methods: A set of 100 questions and answers based on the 2022 AAOS ACL guidelines were curated. Closed-source(Open AI GPT4/GPT 3.5 and Anthropic's Claude3) and open-source models(LLama3 8b/70b and Mistral8x7b) were asked questions in base form and again with AAOS guidelines embedded into a RAG system. The top-performing models were further augmented with Artificial Intelligence(AI) Agents and re-evaluated. Two fellowship-trained surgeons blindly evaluated the accuracy of the responses of each cohort. ROUGE and METEOR scores were calculated to assess semantic similarity in the response.
Results: All non-custom LLM models started below 60% accuracy. Applying RAG improved the accuracy of every model by an average 39.7%. The highest performing model with just RAG was Meta's Open-Source Llama3 70b(94%). The highest performing model with RAG and AI Agents was Open AI's GPT4(95%).
Conclusion: RAG improved accuracy by an average of 39.7%, with the highest accuracy rate of 94% in the Meta Llama3 70b. Incorporating AI agents into a previously RAG-augmented LLM improved ChatGPT4 accuracy rate to 95%. Thus, Agentic and RAG augmented LLMs can be accurate liaisons of information, supporting our hypothesis.
Clinical relevance: Despite literature surrounding the use of LLM in medicine, there has been considerable and appropriate skepticism given the variably accurate response rates. This study establishes the groundwork to identify whether custom modifications to LLMs using RAG and Agentic augmentation can better deliver accurate information in orthopaedic care. With this knowledge, online medical information commonly sought in popular LLMs, such as ChatGPT, can be standardized and provide relevant online medical information to better support shared decision making between surgeon and patient.
期刊介绍:
Nowhere is minimally invasive surgery explained better than in Arthroscopy, the leading peer-reviewed journal in the field. Every issue enables you to put into perspective the usefulness of the various emerging arthroscopic techniques. The advantages and disadvantages of these methods -- along with their applications in various situations -- are discussed in relation to their efficiency, efficacy and cost benefit. As a special incentive, paid subscribers also receive access to the journal expanded website.