Development and Evaluation of a Retrieval-Augmented Large Language Model Framework for Ophthalmology.

IF 7.8 1区医学 Q1 OPHTHALMOLOGY

JAMA ophthalmology Pub Date : 2024-09-01 DOI:10.1001/jamaophthalmol.2024.2513

Ming-Jie Luo, Jianyu Pang, Shaowei Bi, Yunxi Lai, Jiaman Zhao, Yuanrui Shang, Tingxin Cui, Yahan Yang, Zhenzhe Lin, Lanqin Zhao, Xiaohang Wu, Duoru Lin, Jingjing Chen, Haotian Lin

{"title":"Development and Evaluation of a Retrieval-Augmented Large Language Model Framework for Ophthalmology.","authors":"Ming-Jie Luo, Jianyu Pang, Shaowei Bi, Yunxi Lai, Jiaman Zhao, Yuanrui Shang, Tingxin Cui, Yahan Yang, Zhenzhe Lin, Lanqin Zhao, Xiaohang Wu, Duoru Lin, Jingjing Chen, Haotian Lin","doi":"10.1001/jamaophthalmol.2024.2513","DOIUrl":null,"url":null,"abstract":"Importance: Although augmenting large language models (LLMs) with knowledge bases may improve medical domain-specific performance, practical methods are needed for local implementation of LLMs that address privacy concerns and enhance accessibility for health care professionals.Objective: To develop an accurate, cost-effective local implementation of an LLM to mitigate privacy concerns and support their practical deployment in health care settings.Design, setting, and participants: ChatZOC (Sun Yat-Sen University Zhongshan Ophthalmology Center), a retrieval-augmented LLM framework, was developed by enhancing a baseline LLM with a comprehensive ophthalmic dataset and evaluation framework (CODE), which includes over 30 000 pieces of ophthalmic knowledge. This LLM was benchmarked against 10 representative LLMs, including GPT-4 and GPT-3.5 Turbo (OpenAI), across 300 clinical questions in ophthalmology. The evaluation, involving a panel of medical experts and biomedical researchers, focused on accuracy, utility, and safety. A double-masked approach was used to try to minimize bias assessment across all models. The study used a comprehensive knowledge base derived from ophthalmic clinical practice, without directly involving clinical patients.Exposures: LLM response to clinical questions.Main outcomes and measures: Accuracy, utility, and safety of LLMs in responding to clinical questions.Results: The baseline model achieved a human ranking score of 0.48. The retrieval-augmented LLM had a score of 0.60, a difference of 0.12 (95% CI, 0.02-0.22; P = .02) from baseline and not different from GPT-4 with a score of 0.61 (difference = 0.01; 95% CI, -0.11 to 0.13; P = .89). For scientific consensus, the retrieval-augmented LLM was 84.0% compared with the baseline model of 46.5% (difference = 37.5%; 95% CI, 29.0%-46.0%; P < .001) and not different from GPT-4 with a value of 79.2% (difference = 4.8%; 95% CI, -0.3% to 10.0%; P = .06).Conclusions and relevance: Results of this quality improvement study suggest that the integration of high-quality knowledge bases improved the LLM's performance in medical domains. This study highlights the transformative potential of augmented LLMs in clinical practice by providing reliable, safe, and practical clinical information. Further research is needed to explore the broader application of such frameworks in the real world.","PeriodicalId":14518,"journal":{"name":"JAMA ophthalmology","volume":" ","pages":"798-805"},"PeriodicalIF":7.8000,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11258636/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JAMA ophthalmology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1001/jamaophthalmol.2024.2513","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"OPHTHALMOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Importance: Although augmenting large language models (LLMs) with knowledge bases may improve medical domain-specific performance, practical methods are needed for local implementation of LLMs that address privacy concerns and enhance accessibility for health care professionals.

Objective: To develop an accurate, cost-effective local implementation of an LLM to mitigate privacy concerns and support their practical deployment in health care settings.

Design, setting, and participants: ChatZOC (Sun Yat-Sen University Zhongshan Ophthalmology Center), a retrieval-augmented LLM framework, was developed by enhancing a baseline LLM with a comprehensive ophthalmic dataset and evaluation framework (CODE), which includes over 30 000 pieces of ophthalmic knowledge. This LLM was benchmarked against 10 representative LLMs, including GPT-4 and GPT-3.5 Turbo (OpenAI), across 300 clinical questions in ophthalmology. The evaluation, involving a panel of medical experts and biomedical researchers, focused on accuracy, utility, and safety. A double-masked approach was used to try to minimize bias assessment across all models. The study used a comprehensive knowledge base derived from ophthalmic clinical practice, without directly involving clinical patients.

Exposures: LLM response to clinical questions.

Main outcomes and measures: Accuracy, utility, and safety of LLMs in responding to clinical questions.

Results: The baseline model achieved a human ranking score of 0.48. The retrieval-augmented LLM had a score of 0.60, a difference of 0.12 (95% CI, 0.02-0.22; P = .02) from baseline and not different from GPT-4 with a score of 0.61 (difference = 0.01; 95% CI, -0.11 to 0.13; P = .89). For scientific consensus, the retrieval-augmented LLM was 84.0% compared with the baseline model of 46.5% (difference = 37.5%; 95% CI, 29.0%-46.0%; P < .001) and not different from GPT-4 with a value of 79.2% (difference = 4.8%; 95% CI, -0.3% to 10.0%; P = .06).

Conclusions and relevance: Results of this quality improvement study suggest that the integration of high-quality knowledge bases improved the LLM's performance in medical domains. This study highlights the transformative potential of augmented LLMs in clinical practice by providing reliable, safe, and practical clinical information. Further research is needed to explore the broader application of such frameworks in the real world.

查看原文本刊更多论文

眼科检索增强大型语言模型框架的开发与评估。

重要性：虽然用知识库增强大型语言模型（LLM）可以提高特定医疗领域的性能，但本地实施 LLM 需要实用的方法，以解决隐私问题并提高医疗保健专业人员的可及性：目的：开发一种准确、经济高效的本地 LLM 实现方法，以减少隐私问题并支持其在医疗环境中的实际应用：ChatZOC（中山大学中山眼科中心）是一个检索增强型 LLM 框架，它是通过一个综合眼科数据集和评估框架（CODE）来增强基准 LLM 而开发的，该数据集和评估框架包含了超过 30,000 条眼科知识。该 LLM 在 300 个眼科临床问题上与 10 个具有代表性的 LLM（包括 GPT-4 和 GPT-3.5 Turbo (OpenAI)）进行了基准测试。评估由医学专家和生物医学研究人员组成的小组参与，重点关注准确性、实用性和安全性。为了尽量减少所有模型的偏差评估，采用了双掩蔽方法。该研究使用了从眼科临床实践中获得的综合知识库，而不直接涉及临床患者：主要结果和衡量标准：主要结果和测量指标：LLM 回答临床问题的准确性、实用性和安全性：基线模型的人类排名得分为0.48。检索增强型 LLM 得分为 0.60，与基线模型相差 0.12 (95% CI, 0.02-0.22; P = .02)，与 GPT-4 得分 0.61（相差 = 0.01; 95% CI, -0.11 至 0.13; P = .89）无异。在科学共识方面，检索增强的 LLM 为 84.0%，而基线模型为 46.5%（差异 = 37.5%；95% CI，29.0%-46.0%；P 结论和相关性：这项质量改进研究的结果表明，高质量知识库的整合提高了 LLM 在医学领域的性能。这项研究强调了增强型 LLM 通过提供可靠、安全和实用的临床信息在临床实践中的变革潜力。要探索此类框架在现实世界中的更广泛应用，还需要进一步的研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

JAMA ophthalmology OPHTHALMOLOGY-

CiteScore

13.20

自引率

3.70%

发文量

340

期刊介绍： JAMA Ophthalmology, with a rich history of continuous publication since 1869, stands as a distinguished international, peer-reviewed journal dedicated to ophthalmology and visual science. In 2019, the journal proudly commemorated 150 years of uninterrupted service to the field. As a member of the esteemed JAMA Network, a consortium renowned for its peer-reviewed general medical and specialty publications, JAMA Ophthalmology upholds the highest standards of excellence in disseminating cutting-edge research and insights. Join us in celebrating our legacy and advancing the frontiers of ophthalmology and visual science.