GlaucoRAG: A Retrieval-Augmented Large Language Model for Expert-Level Glaucoma Assessment.

Mohammad Aminan, S Solomon Darnell, Mohammad Delsoz, Amin Nabavi, Claire Wright, Brian Jerkins, Siamak Yousefi
{"title":"GlaucoRAG: A Retrieval-Augmented Large Language Model for Expert-Level Glaucoma Assessment.","authors":"Mohammad Aminan, S Solomon Darnell, Mohammad Delsoz, Amin Nabavi, Claire Wright, Brian Jerkins, Siamak Yousefi","doi":"10.1101/2025.07.03.25330805","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>Purpose: Accurate glaucoma assessment is challenging because of the complexity and chronic nature of the disease; therefore, there is a critical need for models that provide evidence-based, accurate assessment. The purpose of this study was to evaluate the capabilities of a glaucoma specialized Retrieval-Augmented Generation (RAG) framework (GlaucoRAG) that leverages a large language model (LLM) for diagnosing glaucoma and answering to glaucoma specific questions.</p><p><strong>Design: </strong>Evaluation of diagnostic capabilities and knowledge of emerging technologies in glaucoma assessment.</p><p><strong>Participants: </strong>Detailed case reports from 11 patients and 250 multiple choice questions from the Basic and Clinical Science Course (BCSC) Self-Assessment were used to test the LLM based GlaucoRAG. No human participants were involved.</p><p><strong>Methods: </strong>We developed GlaucoRAG, a RAG framework leveraging GPT-4.5-PREVIEW integrated with the R2R platform for automated question answering in glaucoma. We created a glaucoma knowledge base comprising more than 1,800 peer-reviewed glaucoma articles, 15 guidelines and three glaucoma textbooks. The diagnostic performance was tested on case reports and multiple-choice questions. Model outputs were compared with the independent answers of three glaucoma specialists, DeepSeek-R1, and GPT-4.5-PREVIEW (without RAG). Quantitative performance was further assessed with the RAG Assessment (RAGAS) framework, reporting faithfulness, context precision, context recall, and answer relevancy.</p><p><strong>Main outcome measures: </strong>The primary outcome measure was GlaucoRAG's diagnostic accuracy on patient case reports and percentage of correct responses to the BCSC Self-Assessment glaucoma items, compared with the performance of glaucoma specialists and two benchmark LLMs. Secondary outcomes included RAGAS sub scores.</p><p><strong>Results: </strong>GlaucoRAG achieved an accuracy of 81.8% on glaucoma case reports, compared with 72.7% for GPT-4.5-PREVIEW and 63.7% for DeepSeek-R1. On glaucoma BCSC Self-Assessment questions, GlaucoRAG achieved 91.2% accuracy (228 / 250), whereas GPT-4.5-PREVIEW and DeepSeek-R1 attained 84.4% (211 / 250) and 76.0% (190 / 250), respectively. The RAGAS evaluation returned an answer relevancy of 91%, with 80% context recall, 70% faithfulness, and 59% context precision.</p><p><strong>Conclusions: </strong>The glaucoma-specialized LLM, GlaucoRAG, showed encouraging performance in glaucoma assessment and may complement glaucoma research and clinical practice as well as question answering with glaucoma patients.</p>","PeriodicalId":94281,"journal":{"name":"medRxiv : the preprint server for health sciences","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12265780/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv : the preprint server for health sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2025.07.03.25330805","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Purpose: Purpose: Accurate glaucoma assessment is challenging because of the complexity and chronic nature of the disease; therefore, there is a critical need for models that provide evidence-based, accurate assessment. The purpose of this study was to evaluate the capabilities of a glaucoma specialized Retrieval-Augmented Generation (RAG) framework (GlaucoRAG) that leverages a large language model (LLM) for diagnosing glaucoma and answering to glaucoma specific questions.

Design: Evaluation of diagnostic capabilities and knowledge of emerging technologies in glaucoma assessment.

Participants: Detailed case reports from 11 patients and 250 multiple choice questions from the Basic and Clinical Science Course (BCSC) Self-Assessment were used to test the LLM based GlaucoRAG. No human participants were involved.

Methods: We developed GlaucoRAG, a RAG framework leveraging GPT-4.5-PREVIEW integrated with the R2R platform for automated question answering in glaucoma. We created a glaucoma knowledge base comprising more than 1,800 peer-reviewed glaucoma articles, 15 guidelines and three glaucoma textbooks. The diagnostic performance was tested on case reports and multiple-choice questions. Model outputs were compared with the independent answers of three glaucoma specialists, DeepSeek-R1, and GPT-4.5-PREVIEW (without RAG). Quantitative performance was further assessed with the RAG Assessment (RAGAS) framework, reporting faithfulness, context precision, context recall, and answer relevancy.

Main outcome measures: The primary outcome measure was GlaucoRAG's diagnostic accuracy on patient case reports and percentage of correct responses to the BCSC Self-Assessment glaucoma items, compared with the performance of glaucoma specialists and two benchmark LLMs. Secondary outcomes included RAGAS sub scores.

Results: GlaucoRAG achieved an accuracy of 81.8% on glaucoma case reports, compared with 72.7% for GPT-4.5-PREVIEW and 63.7% for DeepSeek-R1. On glaucoma BCSC Self-Assessment questions, GlaucoRAG achieved 91.2% accuracy (228 / 250), whereas GPT-4.5-PREVIEW and DeepSeek-R1 attained 84.4% (211 / 250) and 76.0% (190 / 250), respectively. The RAGAS evaluation returned an answer relevancy of 91%, with 80% context recall, 70% faithfulness, and 59% context precision.

Conclusions: The glaucoma-specialized LLM, GlaucoRAG, showed encouraging performance in glaucoma assessment and may complement glaucoma research and clinical practice as well as question answering with glaucoma patients.

GlaucoRAG:一个检索增强的专家级青光眼评估大语言模型。
目的:由于青光眼的复杂性和慢性性质,准确的青光眼评估具有挑战性;因此,迫切需要能够提供基于证据的准确评估的模型。本研究的目的是评估青光眼专用检索增强生成(RAG)框架(GlaucoRAG)的能力,该框架利用大型语言模型(LLM)诊断青光眼并回答青光眼特定问题。设计:评估青光眼评估的诊断能力和新兴技术的知识。参与者:来自11名患者的详细病例报告和来自基础与临床科学课程(BCSC)自我评估的250道选择题被用来测试基于法学硕士的GlaucoRAG。没有人类参与者参与其中。方法:我们开发了GlaucoRAG,这是一个RAG框架,利用GPT-4.5-PREVIEW与R2R平台集成,用于青光眼的自动问答。我们创建了一个青光眼知识库,包括1800多篇同行评议的青光眼文章、15份指南和3本青光眼教科书。诊断性能通过病例报告和多项选择题进行测试。将模型输出与三位青光眼专家DeepSeek-R1和GPT-4.5-PREVIEW(不含RAG)的独立答案进行比较。通过RAG评估(RAGAS)框架进一步评估定量表现,报告可信度,上下文准确性,上下文召回率和答案相关性。主要结果测量:主要结果测量是GlaucoRAG对患者病例报告的诊断准确性和对BCSC自我评估青光眼项目的正确反应百分比,与青光眼专家和两个基准llm的表现进行比较。次要结局包括RAGAS评分。结果:GlaucoRAG对青光眼病例报告的准确率为81.8%,而GPT-4.5-PREVIEW和DeepSeek-R1的准确率分别为72.7%和63.7%。在青光眼BCSC自我评估问题上,GlaucoRAG的准确率为91.2%(228 / 250),而GPT-4.5-PREVIEW和DeepSeek-R1的准确率分别为84.4%(211 / 250)和76.0%(190 / 250)。RAGAS评估的答案相关度为91%,上下文召回率为80%,忠实度为70%,上下文精确度为59%。结论:青光眼专业LLM GlaucoRAG在青光眼评估方面表现令人鼓舞,可以补充青光眼研究和临床实践以及青光眼患者的问题回答。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信