Comparison of Large Language Models' Performance on 600 Nuclear Medicine Technology Board Examination-Style Questions.

IF 1.3 Q4 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING

Journal of nuclear medicine technology Pub Date : 2025-09-05 DOI:10.2967/jnmt.124.269335

Michael A Oumano, Shawn M Pickett

{"title":"Comparison of Large Language Models' Performance on 600 Nuclear Medicine Technology Board Examination-Style Questions.","authors":"Michael A Oumano, Shawn M Pickett","doi":"10.2967/jnmt.124.269335","DOIUrl":null,"url":null,"abstract":"This study investigated the application of large language models (LLMs) with and without retrieval-augmented generation (RAG) in nuclear medicine, particularly their performance across various topics relevant to the field, to evaluate their potential use as reliable tools for professional education and clinical decision-making. Methods: We evaluated the performance of LLMs, including the OpenAI GPT-4o series, Google Gemini, Cohere, Anthropic, and Meta Llama3, across 15 nuclear medicine topics. The models' accuracy was assessed using a set of 600 sample questions, covering a range of clinical and technical domains in nuclear medicine. Overall accuracy was measured by averaging performance across these topics. Additional performance comparisons were conducted across individual models. Results: OpenAI's models, particularly openai_nvidia_gpt-4o_final and openai_mxbai_gpt-4o_final, demonstrated the highest overall accuracy, achieving scores of 0.787 and 0.783, respectively, when RAG was implemented. Anthropic Opus and Google Gemini 1.5 Pro followed closely, with competitive overall accuracy scores of 0.773 and 0.750 with RAG. Cohere and Llama3 models showed more variability in performance, with the Llama3 ollama_llama3 model (without RAG) achieving the lowest accuracy. Discrepancies were noted in question interpretation, particularly in complex clinical guidelines and imaging-based queries. Conclusion: LLMs show promising potential in nuclear medicine, improving diagnostic accuracy, especially in areas like radiation safety and skeletal system scintigraphy. This study also demonstrates that adding a RAG workflow can increase the accuracy of an off-the-shelf model. However, challenges persist in handling nuanced guidelines and visual data, emphasizing the need for further optimization in LLMs for medical applications.","PeriodicalId":16548,"journal":{"name":"Journal of nuclear medicine technology","volume":" ","pages":"262-267"},"PeriodicalIF":1.3000,"publicationDate":"2025-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of nuclear medicine technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2967/jnmt.124.269335","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}

引用次数: 0

Abstract

This study investigated the application of large language models (LLMs) with and without retrieval-augmented generation (RAG) in nuclear medicine, particularly their performance across various topics relevant to the field, to evaluate their potential use as reliable tools for professional education and clinical decision-making. Methods: We evaluated the performance of LLMs, including the OpenAI GPT-4o series, Google Gemini, Cohere, Anthropic, and Meta Llama3, across 15 nuclear medicine topics. The models' accuracy was assessed using a set of 600 sample questions, covering a range of clinical and technical domains in nuclear medicine. Overall accuracy was measured by averaging performance across these topics. Additional performance comparisons were conducted across individual models. Results: OpenAI's models, particularly openai_nvidia_gpt-4o_final and openai_mxbai_gpt-4o_final, demonstrated the highest overall accuracy, achieving scores of 0.787 and 0.783, respectively, when RAG was implemented. Anthropic Opus and Google Gemini 1.5 Pro followed closely, with competitive overall accuracy scores of 0.773 and 0.750 with RAG. Cohere and Llama3 models showed more variability in performance, with the Llama3 ollama_llama3 model (without RAG) achieving the lowest accuracy. Discrepancies were noted in question interpretation, particularly in complex clinical guidelines and imaging-based queries. Conclusion: LLMs show promising potential in nuclear medicine, improving diagnostic accuracy, especially in areas like radiation safety and skeletal system scintigraphy. This study also demonstrates that adding a RAG workflow can increase the accuracy of an off-the-shelf model. However, challenges persist in handling nuanced guidelines and visual data, emphasizing the need for further optimization in LLMs for medical applications.

查看原文本刊更多论文

大型语言模型在600道核医学技术委员会考试题中的表现比较。

本研究调查了具有和不具有检索增强生成（RAG）的大型语言模型（llm）在核医学中的应用，特别是它们在与该领域相关的各种主题中的表现，以评估它们作为专业教育和临床决策可靠工具的潜在用途。方法：我们评估了包括OpenAI gpt - 40系列、谷歌Gemini、Cohere、Anthropic和Meta llam3在内的15个核医学主题的llm的性能。这些模型的准确性是通过一组600个样本问题来评估的，这些样本问题涵盖了核医学的一系列临床和技术领域。总体准确性是通过对这些主题的平均表现来衡量的。在各个模型之间进行了额外的性能比较。结果：在实现RAG时，OpenAI的模型，特别是openai_nvidia_gpt- 40_final和openai_mxbai_gpt- 40_final的整体准确率最高，分别达到0.787和0.783。Anthropic Opus和谷歌Gemini 1.5 Pro紧随其后，具有竞争力的总体准确性得分分别为0.773和0.750。Cohere和Llama3模型在性能上表现出更多的可变性，其中Llama3 ollama_llama3模型（不含RAG）的精度最低。差异在问题解释中被注意到，特别是在复杂的临床指南和基于成像的查询中。结论：llm在核医学领域具有广阔的应用前景，可提高核医学诊断的准确性，特别是在辐射安全和骨骼系统扫描等领域。本研究还证明了添加RAG工作流可以提高现成模型的准确性。然而，在处理细微的指导方针和可视化数据方面仍然存在挑战，这强调了医学应用法学硕士进一步优化的必要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of nuclear medicine technology RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING-

CiteScore

1.90

自引率

15.40%

发文量