GlaucoRAG:一个检索增强的专家级青光眼评估大语言模型。

Mohammad Aminan, S Solomon Darnell, Mohammad Delsoz, Amin Nabavi, Claire Wright, Brian Jerkins, Siamak Yousefi
{"title":"GlaucoRAG:一个检索增强的专家级青光眼评估大语言模型。","authors":"Mohammad Aminan, S Solomon Darnell, Mohammad Delsoz, Amin Nabavi, Claire Wright, Brian Jerkins, Siamak Yousefi","doi":"10.1101/2025.07.03.25330805","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>Purpose: Accurate glaucoma assessment is challenging because of the complexity and chronic nature of the disease; therefore, there is a critical need for models that provide evidence-based, accurate assessment. The purpose of this study was to evaluate the capabilities of a glaucoma specialized Retrieval-Augmented Generation (RAG) framework (GlaucoRAG) that leverages a large language model (LLM) for diagnosing glaucoma and answering to glaucoma specific questions.</p><p><strong>Design: </strong>Evaluation of diagnostic capabilities and knowledge of emerging technologies in glaucoma assessment.</p><p><strong>Participants: </strong>Detailed case reports from 11 patients and 250 multiple choice questions from the Basic and Clinical Science Course (BCSC) Self-Assessment were used to test the LLM based GlaucoRAG. No human participants were involved.</p><p><strong>Methods: </strong>We developed GlaucoRAG, a RAG framework leveraging GPT-4.5-PREVIEW integrated with the R2R platform for automated question answering in glaucoma. We created a glaucoma knowledge base comprising more than 1,800 peer-reviewed glaucoma articles, 15 guidelines and three glaucoma textbooks. The diagnostic performance was tested on case reports and multiple-choice questions. Model outputs were compared with the independent answers of three glaucoma specialists, DeepSeek-R1, and GPT-4.5-PREVIEW (without RAG). Quantitative performance was further assessed with the RAG Assessment (RAGAS) framework, reporting faithfulness, context precision, context recall, and answer relevancy.</p><p><strong>Main outcome measures: </strong>The primary outcome measure was GlaucoRAG's diagnostic accuracy on patient case reports and percentage of correct responses to the BCSC Self-Assessment glaucoma items, compared with the performance of glaucoma specialists and two benchmark LLMs. Secondary outcomes included RAGAS sub scores.</p><p><strong>Results: </strong>GlaucoRAG achieved an accuracy of 81.8% on glaucoma case reports, compared with 72.7% for GPT-4.5-PREVIEW and 63.7% for DeepSeek-R1. On glaucoma BCSC Self-Assessment questions, GlaucoRAG achieved 91.2% accuracy (228 / 250), whereas GPT-4.5-PREVIEW and DeepSeek-R1 attained 84.4% (211 / 250) and 76.0% (190 / 250), respectively. The RAGAS evaluation returned an answer relevancy of 91%, with 80% context recall, 70% faithfulness, and 59% context precision.</p><p><strong>Conclusions: </strong>The glaucoma-specialized LLM, GlaucoRAG, showed encouraging performance in glaucoma assessment and may complement glaucoma research and clinical practice as well as question answering with glaucoma patients.</p>","PeriodicalId":94281,"journal":{"name":"medRxiv : the preprint server for health sciences","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12265780/pdf/","citationCount":"0","resultStr":"{\"title\":\"GlaucoRAG: A Retrieval-Augmented Large Language Model for Expert-Level Glaucoma Assessment.\",\"authors\":\"Mohammad Aminan, S Solomon Darnell, Mohammad Delsoz, Amin Nabavi, Claire Wright, Brian Jerkins, Siamak Yousefi\",\"doi\":\"10.1101/2025.07.03.25330805\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Purpose: </strong>Purpose: Accurate glaucoma assessment is challenging because of the complexity and chronic nature of the disease; therefore, there is a critical need for models that provide evidence-based, accurate assessment. The purpose of this study was to evaluate the capabilities of a glaucoma specialized Retrieval-Augmented Generation (RAG) framework (GlaucoRAG) that leverages a large language model (LLM) for diagnosing glaucoma and answering to glaucoma specific questions.</p><p><strong>Design: </strong>Evaluation of diagnostic capabilities and knowledge of emerging technologies in glaucoma assessment.</p><p><strong>Participants: </strong>Detailed case reports from 11 patients and 250 multiple choice questions from the Basic and Clinical Science Course (BCSC) Self-Assessment were used to test the LLM based GlaucoRAG. No human participants were involved.</p><p><strong>Methods: </strong>We developed GlaucoRAG, a RAG framework leveraging GPT-4.5-PREVIEW integrated with the R2R platform for automated question answering in glaucoma. We created a glaucoma knowledge base comprising more than 1,800 peer-reviewed glaucoma articles, 15 guidelines and three glaucoma textbooks. The diagnostic performance was tested on case reports and multiple-choice questions. Model outputs were compared with the independent answers of three glaucoma specialists, DeepSeek-R1, and GPT-4.5-PREVIEW (without RAG). Quantitative performance was further assessed with the RAG Assessment (RAGAS) framework, reporting faithfulness, context precision, context recall, and answer relevancy.</p><p><strong>Main outcome measures: </strong>The primary outcome measure was GlaucoRAG's diagnostic accuracy on patient case reports and percentage of correct responses to the BCSC Self-Assessment glaucoma items, compared with the performance of glaucoma specialists and two benchmark LLMs. Secondary outcomes included RAGAS sub scores.</p><p><strong>Results: </strong>GlaucoRAG achieved an accuracy of 81.8% on glaucoma case reports, compared with 72.7% for GPT-4.5-PREVIEW and 63.7% for DeepSeek-R1. On glaucoma BCSC Self-Assessment questions, GlaucoRAG achieved 91.2% accuracy (228 / 250), whereas GPT-4.5-PREVIEW and DeepSeek-R1 attained 84.4% (211 / 250) and 76.0% (190 / 250), respectively. The RAGAS evaluation returned an answer relevancy of 91%, with 80% context recall, 70% faithfulness, and 59% context precision.</p><p><strong>Conclusions: </strong>The glaucoma-specialized LLM, GlaucoRAG, showed encouraging performance in glaucoma assessment and may complement glaucoma research and clinical practice as well as question answering with glaucoma patients.</p>\",\"PeriodicalId\":94281,\"journal\":{\"name\":\"medRxiv : the preprint server for health sciences\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-07-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12265780/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"medRxiv : the preprint server for health sciences\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1101/2025.07.03.25330805\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv : the preprint server for health sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2025.07.03.25330805","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

目的:由于青光眼的复杂性和慢性性质,准确的青光眼评估具有挑战性;因此,迫切需要能够提供基于证据的准确评估的模型。本研究的目的是评估青光眼专用检索增强生成(RAG)框架(GlaucoRAG)的能力,该框架利用大型语言模型(LLM)诊断青光眼并回答青光眼特定问题。设计:评估青光眼评估的诊断能力和新兴技术的知识。参与者:来自11名患者的详细病例报告和来自基础与临床科学课程(BCSC)自我评估的250道选择题被用来测试基于法学硕士的GlaucoRAG。没有人类参与者参与其中。方法:我们开发了GlaucoRAG,这是一个RAG框架,利用GPT-4.5-PREVIEW与R2R平台集成,用于青光眼的自动问答。我们创建了一个青光眼知识库,包括1800多篇同行评议的青光眼文章、15份指南和3本青光眼教科书。诊断性能通过病例报告和多项选择题进行测试。将模型输出与三位青光眼专家DeepSeek-R1和GPT-4.5-PREVIEW(不含RAG)的独立答案进行比较。通过RAG评估(RAGAS)框架进一步评估定量表现,报告可信度,上下文准确性,上下文召回率和答案相关性。主要结果测量:主要结果测量是GlaucoRAG对患者病例报告的诊断准确性和对BCSC自我评估青光眼项目的正确反应百分比,与青光眼专家和两个基准llm的表现进行比较。次要结局包括RAGAS评分。结果:GlaucoRAG对青光眼病例报告的准确率为81.8%,而GPT-4.5-PREVIEW和DeepSeek-R1的准确率分别为72.7%和63.7%。在青光眼BCSC自我评估问题上,GlaucoRAG的准确率为91.2%(228 / 250),而GPT-4.5-PREVIEW和DeepSeek-R1的准确率分别为84.4%(211 / 250)和76.0%(190 / 250)。RAGAS评估的答案相关度为91%,上下文召回率为80%,忠实度为70%,上下文精确度为59%。结论:青光眼专业LLM GlaucoRAG在青光眼评估方面表现令人鼓舞,可以补充青光眼研究和临床实践以及青光眼患者的问题回答。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
GlaucoRAG: A Retrieval-Augmented Large Language Model for Expert-Level Glaucoma Assessment.

Purpose: Purpose: Accurate glaucoma assessment is challenging because of the complexity and chronic nature of the disease; therefore, there is a critical need for models that provide evidence-based, accurate assessment. The purpose of this study was to evaluate the capabilities of a glaucoma specialized Retrieval-Augmented Generation (RAG) framework (GlaucoRAG) that leverages a large language model (LLM) for diagnosing glaucoma and answering to glaucoma specific questions.

Design: Evaluation of diagnostic capabilities and knowledge of emerging technologies in glaucoma assessment.

Participants: Detailed case reports from 11 patients and 250 multiple choice questions from the Basic and Clinical Science Course (BCSC) Self-Assessment were used to test the LLM based GlaucoRAG. No human participants were involved.

Methods: We developed GlaucoRAG, a RAG framework leveraging GPT-4.5-PREVIEW integrated with the R2R platform for automated question answering in glaucoma. We created a glaucoma knowledge base comprising more than 1,800 peer-reviewed glaucoma articles, 15 guidelines and three glaucoma textbooks. The diagnostic performance was tested on case reports and multiple-choice questions. Model outputs were compared with the independent answers of three glaucoma specialists, DeepSeek-R1, and GPT-4.5-PREVIEW (without RAG). Quantitative performance was further assessed with the RAG Assessment (RAGAS) framework, reporting faithfulness, context precision, context recall, and answer relevancy.

Main outcome measures: The primary outcome measure was GlaucoRAG's diagnostic accuracy on patient case reports and percentage of correct responses to the BCSC Self-Assessment glaucoma items, compared with the performance of glaucoma specialists and two benchmark LLMs. Secondary outcomes included RAGAS sub scores.

Results: GlaucoRAG achieved an accuracy of 81.8% on glaucoma case reports, compared with 72.7% for GPT-4.5-PREVIEW and 63.7% for DeepSeek-R1. On glaucoma BCSC Self-Assessment questions, GlaucoRAG achieved 91.2% accuracy (228 / 250), whereas GPT-4.5-PREVIEW and DeepSeek-R1 attained 84.4% (211 / 250) and 76.0% (190 / 250), respectively. The RAGAS evaluation returned an answer relevancy of 91%, with 80% context recall, 70% faithfulness, and 59% context precision.

Conclusions: The glaucoma-specialized LLM, GlaucoRAG, showed encouraging performance in glaucoma assessment and may complement glaucoma research and clinical practice as well as question answering with glaucoma patients.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信