RadioRAG：放射学问答的在线检索增强生成。

IF 13.2 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Radiology-Artificial Intelligence Pub Date : 2025-07-01 DOI:10.1148/ryai.240476

Soroosh Tayebi Arasteh, Mahshad Lotfinia, Keno Bressem, Robert Siepmann, Lisa Adams, Dyke Ferber, Christiane Kuhl, Jakob Nikolas Kather, Sven Nebelung, Daniel Truhn

{"title":"RadioRAG：放射学问答的在线检索增强生成。","authors":"Soroosh Tayebi Arasteh, Mahshad Lotfinia, Keno Bressem, Robert Siepmann, Lisa Adams, Dyke Ferber, Christiane Kuhl, Jakob Nikolas Kather, Sven Nebelung, Daniel Truhn","doi":"10.1148/ryai.240476","DOIUrl":null,"url":null,"abstract":"Purpose To evaluate diagnostic accuracy of various large language models (LLMs) when answering radiology-specific questions with and without access to additional online, up-to-date information via retrieval-augmented generation (RAG). Materials and Methods The authors developed radiology RAG (RadioRAG), an end-to-end framework that retrieves data from authoritative radiologic online sources in real-time. RAG incorporates information retrieval from external sources to supplement the initial prompt, grounding the model's response in relevant information. Using 80 questions from the RSNA Case Collection across radiologic subspecialties and 24 additional expert-curated questions with reference standard answers, LLMs (GPT-3.5-turbo [OpenAI], GPT-4, Mistral 7B, Mixtral 8×7B [Mistral], and Llama3-8B and -70B [Meta]) were prompted with and without RadioRAG in a zero-shot inference scenario (temperature ≤ 0.1, top-p = 1). RadioRAG retrieved context-specific information from www.radiopaedia.org. Accuracy of LLMs with and without RadioRAG in answering questions from each dataset was assessed. Statistical analyses were performed using bootstrapping while preserving pairing. Additional assessments included comparison of model with human performance and comparison of time required for conventional versus RadioRAG-powered question answering. Results RadioRAG improved accuracy for some LLMs, including GPT-3.5-turbo (74% [59 of 80] vs 66% [53 of 80], false discovery rate [FDR] = 0.03) and Mixtral 8×7B (76% [61 of 80] vs 65% [52 of 80], FDR = 0.02) on the RSNA radiology question answering (RSNA-RadioQA) dataset, with similar trends in the ExtendedQA dataset. Accuracy exceeded that of a human expert (63% [50 of 80], FDR ≤ 0.007) for these LLMs, although not for Mistral 7B-instruct-v0.2, Llama3-8B, and Llama3-70B (FDR ≥ 0.21). RadioRAG reduced hallucinations for all LLMs (rate, 6%-25%). RadioRAG increased estimated response time fourfold. Conclusion RadioRAG shows potential to improve LLM accuracy and factuality in radiology QA by integrating real-time, domain-specific data. Keywords: Retrieval-augmented Generation, Informatics, Computer-aided Diagnosis, Large Language Models Supplemental material is available for this article. © RSNA, 2025.","PeriodicalId":29787,"journal":{"name":"Radiology-Artificial Intelligence","volume":" ","pages":"e240476"},"PeriodicalIF":13.2000,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12326075/pdf/","citationCount":"0","resultStr":"{\"title\":\"RadioRAG: Online Retrieval-Augmented Generation for Radiology Question Answering.\",\"authors\":\"Soroosh Tayebi Arasteh, Mahshad Lotfinia, Keno Bressem, Robert Siepmann, Lisa Adams, Dyke Ferber, Christiane Kuhl, Jakob Nikolas Kather, Sven Nebelung, Daniel Truhn\",\"doi\":\"10.1148/ryai.240476\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Purpose To evaluate diagnostic accuracy of various large language models (LLMs) when answering radiology-specific questions with and without access to additional online, up-to-date information via retrieval-augmented generation (RAG). Materials and Methods The authors developed radiology RAG (RadioRAG), an end-to-end framework that retrieves data from authoritative radiologic online sources in real-time. RAG incorporates information retrieval from external sources to supplement the initial prompt, grounding the model's response in relevant information. Using 80 questions from the RSNA Case Collection across radiologic subspecialties and 24 additional expert-curated questions with reference standard answers, LLMs (GPT-3.5-turbo [OpenAI], GPT-4, Mistral 7B, Mixtral 8×7B [Mistral], and Llama3-8B and -70B [Meta]) were prompted with and without RadioRAG in a zero-shot inference scenario (temperature ≤ 0.1, top-p = 1). RadioRAG retrieved context-specific information from www.radiopaedia.org. Accuracy of LLMs with and without RadioRAG in answering questions from each dataset was assessed. Statistical analyses were performed using bootstrapping while preserving pairing. Additional assessments included comparison of model with human performance and comparison of time required for conventional versus RadioRAG-powered question answering. Results RadioRAG improved accuracy for some LLMs, including GPT-3.5-turbo (74% [59 of 80] vs 66% [53 of 80], false discovery rate [FDR] = 0.03) and Mixtral 8×7B (76% [61 of 80] vs 65% [52 of 80], FDR = 0.02) on the RSNA radiology question answering (RSNA-RadioQA) dataset, with similar trends in the ExtendedQA dataset. Accuracy exceeded that of a human expert (63% [50 of 80], FDR ≤ 0.007) for these LLMs, although not for Mistral 7B-instruct-v0.2, Llama3-8B, and Llama3-70B (FDR ≥ 0.21). RadioRAG reduced hallucinations for all LLMs (rate, 6%-25%). RadioRAG increased estimated response time fourfold. Conclusion RadioRAG shows potential to improve LLM accuracy and factuality in radiology QA by integrating real-time, domain-specific data. Keywords: Retrieval-augmented Generation, Informatics, Computer-aided Diagnosis, Large Language Models Supplemental material is available for this article. © RSNA, 2025.\",\"PeriodicalId\":29787,\"journal\":{\"name\":\"Radiology-Artificial Intelligence\",\"volume\":\" \",\"pages\":\"e240476\"},\"PeriodicalIF\":13.2000,\"publicationDate\":\"2025-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12326075/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Radiology-Artificial Intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1148/ryai.240476\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Radiology-Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1148/ryai.240476","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

“刚刚接受”的论文经过了全面的同行评审，并已被接受发表在《放射学：人工智能》杂志上。这篇文章将经过编辑，布局和校样审查，然后在其最终版本出版。请注意，在最终编辑文章的制作过程中，可能会发现可能影响内容的错误。目的评估各种大型语言模型（llm）在回答放射学特定问题时的诊断准确性，并通过检索增强生成（RAG）获取额外的在线最新信息。材料和方法作者开发了Radiology RAG (RadioRAG)，这是一个端到端的框架，可以实时从权威的放射学在线资源中检索数据。RAG结合了来自外部来源的信息检索，以补充初始提示，将模型的响应建立在相关信息中。利用来自RSNA病例集的80个问题和24个额外的专家设计的问题，以及参考标准答案，LLMs （GPT-3.5-turbo、GPT-4、Mistral-7B、Mixtral-8 × 7B和Llama3 [8B和70B]）在零shot推断场景（温度≤0.1,top- P = 1）中提示是否使用RadioRAG。RadioRAG从www.radiopaedia.org检索特定于上下文的信息。评估了使用和不使用RadioRAG的llm在回答每个数据集的问题时的准确性。统计分析采用自举法，同时保持配对。其他评估包括将模型与人工表现进行比较，以及将传统问答与radiorag问答所需的时间进行比较。结果RadioRAG提高了一些LLMs的准确性，包括GPT-3.5-turbo[74%（59/80）对66% (53/80),FDR = 0.03]和Mixtral-8 × 7B[76%（61/80）对65% (52/80),FDR = 0.02]，在RSNA-RadioQA数据集中也有类似的趋势。对于这些llm，准确率超过人类专家（FDR≤0.007）（63%,(50/80)），而对于mistral - 7b - directive -v0.2， Llama3-8B和Llama3-70B则没有（FDR≥0.21）。RadioRAG减少了所有llm的幻觉（比率从6-25%）。RadioRAG将估计响应时间提高了四倍。结论RadioRAG通过整合实时领域特定数据，有可能提高LLM在放射学问题回答中的准确性和真实性。©RSNA, 2025年。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

RadioRAG: Online Retrieval-Augmented Generation for Radiology Question Answering.

Purpose To evaluate diagnostic accuracy of various large language models (LLMs) when answering radiology-specific questions with and without access to additional online, up-to-date information via retrieval-augmented generation (RAG). Materials and Methods The authors developed radiology RAG (RadioRAG), an end-to-end framework that retrieves data from authoritative radiologic online sources in real-time. RAG incorporates information retrieval from external sources to supplement the initial prompt, grounding the model's response in relevant information. Using 80 questions from the RSNA Case Collection across radiologic subspecialties and 24 additional expert-curated questions with reference standard answers, LLMs (GPT-3.5-turbo [OpenAI], GPT-4, Mistral 7B, Mixtral 8×7B [Mistral], and Llama3-8B and -70B [Meta]) were prompted with and without RadioRAG in a zero-shot inference scenario (temperature ≤ 0.1, top-p = 1). RadioRAG retrieved context-specific information from www.radiopaedia.org. Accuracy of LLMs with and without RadioRAG in answering questions from each dataset was assessed. Statistical analyses were performed using bootstrapping while preserving pairing. Additional assessments included comparison of model with human performance and comparison of time required for conventional versus RadioRAG-powered question answering. Results RadioRAG improved accuracy for some LLMs, including GPT-3.5-turbo (74% [59 of 80] vs 66% [53 of 80], false discovery rate [FDR] = 0.03) and Mixtral 8×7B (76% [61 of 80] vs 65% [52 of 80], FDR = 0.02) on the RSNA radiology question answering (RSNA-RadioQA) dataset, with similar trends in the ExtendedQA dataset. Accuracy exceeded that of a human expert (63% [50 of 80], FDR ≤ 0.007) for these LLMs, although not for Mistral 7B-instruct-v0.2, Llama3-8B, and Llama3-70B (FDR ≥ 0.21). RadioRAG reduced hallucinations for all LLMs (rate, 6%-25%). RadioRAG increased estimated response time fourfold. Conclusion RadioRAG shows potential to improve LLM accuracy and factuality in radiology QA by integrating real-time, domain-specific data. Keywords: Retrieval-augmented Generation, Informatics, Computer-aided Diagnosis, Large Language Models Supplemental material is available for this article. © RSNA, 2025.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Radiology-Artificial Intelligence

CiteScore

16.20

自引率

1.00%

发文量

期刊介绍： Radiology: Artificial Intelligence is a bi-monthly publication that focuses on the emerging applications of machine learning and artificial intelligence in the field of imaging across various disciplines. This journal is available online and accepts multiple manuscript types, including Original Research, Technical Developments, Data Resources, Review articles, Editorials, Letters to the Editor and Replies, Special Reports, and AI in Brief.