Evaluation of a retrieval-augmented generation system using a Japanese Institutional Nuclear Medicine Manual and large language model-automated scoring.

IF 1.5 Q3 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING

Radiological Physics and Technology Pub Date : 2025-09-01 Epub Date: 2025-07-19 DOI:10.1007/s12194-025-00941-y

Yusuke Fukui, Yuhei Kawata, Kazumasa Kobashi, Yukihiro Nagatani, Harumi Iguchi

{"title":"Evaluation of a retrieval-augmented generation system using a Japanese Institutional Nuclear Medicine Manual and large language model-automated scoring.","authors":"Yusuke Fukui, Yuhei Kawata, Kazumasa Kobashi, Yukihiro Nagatani, Harumi Iguchi","doi":"10.1007/s12194-025-00941-y","DOIUrl":null,"url":null,"abstract":"<p><p>Recent advances in large language models (LLMs) enable domain-specific question answering using external knowledge. However, addressing information that is not included in training data remains a challenge, particularly in nuclear medicine, where examination protocols are frequently updated and vary across institutions. In this study, we developed a retrieval-augmented generation (RAG) system using 40 internal manuals from a single Japanese hospital, each corresponding to a different examination in nuclear medicine. These institution-specific documents were segmented and indexed using a hybrid retrieval strategy combining dense vector search (text-embedding-3-small) and sparse keyword search (BM25). GPT-3.5 and GPT-4o were used with the OpenAI application programming interface (API) for response generation. The quality of the generated answers was assessed using a four-point Likert scale by three certified radiological technologists, of which one held an additional certification in nuclear medicine and another held an additional certification in medical physics. Automated evaluation was conducted using RAGAS metrics, including factual correctness and context recall. The GPT-4o model combined with hybrid retrieval achieved the highest performance, as per expert evaluations. Although traditional string-based metrics such as ROUGE and the Levenshtein distance poorly align with human ratings, RAGAS provided consistent rankings across system configurations, despite showing only a modest correlation with manual scores. These findings demonstrate that integrating examination-specific institutional manuals into RAG frameworks can effectively support domain-specific question answering in nuclear medicine. Moreover, LLM-based evaluation methods such as RAGAS may serve as practical tools to complement expert reviews in developing healthcare-oriented artificial intelligence systems.</p>","PeriodicalId":46252,"journal":{"name":"Radiological Physics and Technology","volume":" ","pages":"861-876"},"PeriodicalIF":1.5000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12339626/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Radiological Physics and Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s12194-025-00941-y","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/7/19 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}

引用次数: 0

Abstract

Recent advances in large language models (LLMs) enable domain-specific question answering using external knowledge. However, addressing information that is not included in training data remains a challenge, particularly in nuclear medicine, where examination protocols are frequently updated and vary across institutions. In this study, we developed a retrieval-augmented generation (RAG) system using 40 internal manuals from a single Japanese hospital, each corresponding to a different examination in nuclear medicine. These institution-specific documents were segmented and indexed using a hybrid retrieval strategy combining dense vector search (text-embedding-3-small) and sparse keyword search (BM25). GPT-3.5 and GPT-4o were used with the OpenAI application programming interface (API) for response generation. The quality of the generated answers was assessed using a four-point Likert scale by three certified radiological technologists, of which one held an additional certification in nuclear medicine and another held an additional certification in medical physics. Automated evaluation was conducted using RAGAS metrics, including factual correctness and context recall. The GPT-4o model combined with hybrid retrieval achieved the highest performance, as per expert evaluations. Although traditional string-based metrics such as ROUGE and the Levenshtein distance poorly align with human ratings, RAGAS provided consistent rankings across system configurations, despite showing only a modest correlation with manual scores. These findings demonstrate that integrating examination-specific institutional manuals into RAG frameworks can effectively support domain-specific question answering in nuclear medicine. Moreover, LLM-based evaluation methods such as RAGAS may serve as practical tools to complement expert reviews in developing healthcare-oriented artificial intelligence systems.

查看原文本刊更多论文

使用日本机构核医学手册和大型语言模型自动评分的检索增强生成系统的评估。

大型语言模型（llm）的最新进展支持使用外部知识回答特定领域的问题。然而，处理培训数据中未包含的信息仍然是一个挑战，特别是在核医学领域，其中检查方案经常更新并且各机构各不相同。在这项研究中，我们利用日本一家医院的40份内部手册开发了一个检索增强生成（RAG）系统，每份手册对应于核医学的不同检查。使用结合密集向量搜索（text-embedding-3-small）和稀疏关键字搜索（BM25）的混合检索策略对这些机构特定的文档进行分割和索引。使用GPT-3.5和gpt - 40与OpenAI应用程序编程接口（API）生成响应。生成的答案的质量由三名经过认证的放射技术人员使用四分制李克特量表进行评估，其中一名持有核医学的额外认证，另一名持有医学物理学的额外认证。使用RAGAS度量进行自动化评估，包括事实正确性和上下文回忆。根据专家评估，gpt - 40模型结合混合检索获得了最高的性能。尽管传统的基于字符串的度量（如ROUGE和Levenshtein距离）与人类评分不太一致，RAGAS提供了跨系统配置的一致排名，尽管只显示了与手动评分的适度相关性。这些发现表明，将特定检查的机构手册整合到RAG框架中可以有效地支持核医学领域特定问题的回答。此外，基于法学硕士的评估方法，如RAGAS，可以作为实用工具，在开发面向医疗保健的人工智能系统中补充专家评审。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Radiological Physics and Technology RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING-

CiteScore

3.00

自引率

12.50%

发文量

期刊介绍： The purpose of the journal Radiological Physics and Technology is to provide a forum for sharing new knowledge related to research and development in radiological science and technology, including medical physics and radiological technology in diagnostic radiology, nuclear medicine, and radiation therapy among many other radiological disciplines, as well as to contribute to progress and improvement in medical practice and patient health care.