Advancing Question-Answering in Ophthalmology With Retrieval-Augmented Generation: Benchmarking Open-Source and Proprietary Large Language Models.

IF 2.6 3区医学 Q2 OPHTHALMOLOGY

Translational Vision Science & Technology Pub Date : 2025-09-02 DOI:10.1167/tvst.14.9.18

Quang Nguyen, Duy-Anh Nguyen, Khang Dang, Siyin Liu, Sophia Y Wang, William A Woof, Peter B M Thomas, Praveen J Patel, Konstantinos Balaskas, Johan H Thygesen, Honghan Wu, Nikolas Pontikos

{"title":"Advancing Question-Answering in Ophthalmology With Retrieval-Augmented Generation: Benchmarking Open-Source and Proprietary Large Language Models.","authors":"Quang Nguyen, Duy-Anh Nguyen, Khang Dang, Siyin Liu, Sophia Y Wang, William A Woof, Peter B M Thomas, Praveen J Patel, Konstantinos Balaskas, Johan H Thygesen, Honghan Wu, Nikolas Pontikos","doi":"10.1167/tvst.14.9.18","DOIUrl":null,"url":null,"abstract":"Purpose: The purpose of this study was to evaluate the application of combining information retrieval with text generation using Retrieval-Augmented Generation (RAG) to benchmark the performance of open-source and proprietary generative large language models (LLMs) in question-answering in ophthalmology.Methods: Our dataset comprised 260 multiple-choice questions sourced from two question-answer banks designed to assess ophthalmic knowledge: the American Academy of Ophthalmology's (AAO) Basic and Clinical Science Course (BCSC) Self-Assessment program and OphthoQuestions. Our RAG pipeline retrieves documents in the BCSC companion textbook using ChromaDB, followed by reranking with Cohere to refine the context provided to the LLMs. Generative Pretrained Transformer (GPT)-4-turbo and 3 open-source models (Llama-3-70B, Gemma-2-27B, and Mixtral-8 × 7B) are benchmarked using zero-shot, zero-shot with Chain-of-Thought (zero-shot-CoT), and RAG. Model performance is evaluated using accuracy on the two datasets. Quantization is applied to improve the efficiency of the open-source models. Effects of quantization level are also measured.Results: Using RAG, GPT-4-turbo's accuracy increased by 11.54% on BCSC and by 10.96% on OphthoQuestions. Importantly, the RAG pipeline greatly enhances overall performance of Llama-3 by 23.85%, Gemma-2 by 17.11%, and Mixtral-8 × 7B by 22.11%. Zero-shot-CoT had overall no significant improvement on the models' performance. Quantization using 4 bit was shown to be as effective as using 8 bits while requiring half the resources.Conclusions: Our work demonstrates that integrating RAG significantly enhances LLM accuracy especially for smaller LLMs.Translation relevance: Using our RAG, smaller privacy-preserving open-source LLMs can be run in sensitive and resource-constrained environments, such as within hospitals, offering a viable alternative to cloud-based LLMs like GPT-4-turbo.","PeriodicalId":23322,"journal":{"name":"Translational Vision Science & Technology","volume":"14 9","pages":"18"},"PeriodicalIF":2.6000,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12439504/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Translational Vision Science & Technology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1167/tvst.14.9.18","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"OPHTHALMOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose: The purpose of this study was to evaluate the application of combining information retrieval with text generation using Retrieval-Augmented Generation (RAG) to benchmark the performance of open-source and proprietary generative large language models (LLMs) in question-answering in ophthalmology.

Methods: Our dataset comprised 260 multiple-choice questions sourced from two question-answer banks designed to assess ophthalmic knowledge: the American Academy of Ophthalmology's (AAO) Basic and Clinical Science Course (BCSC) Self-Assessment program and OphthoQuestions. Our RAG pipeline retrieves documents in the BCSC companion textbook using ChromaDB, followed by reranking with Cohere to refine the context provided to the LLMs. Generative Pretrained Transformer (GPT)-4-turbo and 3 open-source models (Llama-3-70B, Gemma-2-27B, and Mixtral-8 × 7B) are benchmarked using zero-shot, zero-shot with Chain-of-Thought (zero-shot-CoT), and RAG. Model performance is evaluated using accuracy on the two datasets. Quantization is applied to improve the efficiency of the open-source models. Effects of quantization level are also measured.

Results: Using RAG, GPT-4-turbo's accuracy increased by 11.54% on BCSC and by 10.96% on OphthoQuestions. Importantly, the RAG pipeline greatly enhances overall performance of Llama-3 by 23.85%, Gemma-2 by 17.11%, and Mixtral-8 × 7B by 22.11%. Zero-shot-CoT had overall no significant improvement on the models' performance. Quantization using 4 bit was shown to be as effective as using 8 bits while requiring half the resources.

Conclusions: Our work demonstrates that integrating RAG significantly enhances LLM accuracy especially for smaller LLMs.

Translation relevance: Using our RAG, smaller privacy-preserving open-source LLMs can be run in sensitive and resource-constrained environments, such as within hospitals, offering a viable alternative to cloud-based LLMs like GPT-4-turbo.

Abstract Image

查看原文本刊更多论文

用检索增强生成推进眼科问答：对开源和专有大型语言模型进行基准测试。

目的：本研究的目的是评估信息检索与文本生成相结合的应用，使用检索增强生成（RAG）对开源和专有生成大语言模型（llm）在眼科问答中的性能进行基准测试。方法：我们的数据集包括260个选择题，这些选择题来自两个旨在评估眼科知识的问答库：美国眼科学会（AAO）基础与临床科学课程（BCSC）自我评估程序和眼科问题。我们的RAG管道使用ChromaDB检索BCSC配套教科书中的文档，然后使用coherence重新排序以优化提供给法学硕士的上下文。生成式预训练变压器(GPT)-4-turbo和3个开源模型（Llama-3-70B， Gemma-2-27B和Mixtral-8 × 7B）使用零射击，零射击与思维链（zero-shot- cot）和RAG进行基准测试。使用两个数据集上的精度来评估模型性能。采用量化来提高开源模型的效率。量化水平的影响也被测量。结果：使用RAG后，GPT-4-turbo在BCSC和OphthoQuestions上的准确率分别提高了11.54%和10.96%。重要的是，RAG管道大大提高了Llama-3、Gemma-2和Mixtral-8 × 7B的整体性能，分别提高了23.85%、17.11%和22.11%。Zero-shot-CoT总体上对模型的性能没有显著改善。使用4位的量化被证明与使用8位的量化一样有效，而只需要一半的资源。结论：我们的工作表明，集成RAG显著提高了LLM的准确性，特别是对于较小的LLM。翻译相关性：使用我们的RAG，小型保护隐私的开源llm可以在敏感和资源受限的环境中运行，例如在医院内，为GPT-4-turbo等基于云的llm提供了可行的替代方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Translational Vision Science & Technology Engineering-Biomedical Engineering

CiteScore

5.70

自引率

3.30%

发文量

346

审稿时长

25 weeks

期刊介绍： Translational Vision Science & Technology (TVST), an official journal of the Association for Research in Vision and Ophthalmology (ARVO), an international organization whose purpose is to advance research worldwide into understanding the visual system and preventing, treating and curing its disorders, is an online, open access, peer-reviewed journal emphasizing multidisciplinary research that bridges the gap between basic research and clinical care. A highly qualified and diverse group of Associate Editors and Editorial Board Members is led by Editor-in-Chief Marco Zarbin, MD, PhD, FARVO. The journal covers a broad spectrum of work, including but not limited to: Applications of stem cell technology for regenerative medicine, Development of new animal models of human diseases, Tissue bioengineering, Chemical engineering to improve virus-based gene delivery, Nanotechnology for drug delivery, Design and synthesis of artificial extracellular matrices, Development of a true microsurgical operating environment, Refining data analysis algorithms to improve in vivo imaging technology, Results of Phase 1 clinical trials, Reverse translational ("bedside to bench") research. TVST seeks manuscripts from scientists and clinicians with diverse backgrounds ranging from basic chemistry to ophthalmic surgery that will advance or change the way we understand and/or treat vision-threatening diseases. TVST encourages the use of color, multimedia, hyperlinks, program code and other digital enhancements.