Vision-language model performance on the Japanese Nuclear Medicine Board Examination: high accuracy in text but challenges with image interpretation.

IF 2.5 4区医学 Q2 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING

Annals of Nuclear Medicine Pub Date : 2025-07-15 DOI:10.1007/s12149-025-02084-x

Rintaro Ito, Keita Kato, Marina Higashi, Yumi Abe, Ryogo Minamimoto, Katsuhiko Kato, Toshiaki Taoka, Shinji Naganawa

{"title":"Vision-language model performance on the Japanese Nuclear Medicine Board Examination: high accuracy in text but challenges with image interpretation.","authors":"Rintaro Ito, Keita Kato, Marina Higashi, Yumi Abe, Ryogo Minamimoto, Katsuhiko Kato, Toshiaki Taoka, Shinji Naganawa","doi":"10.1007/s12149-025-02084-x","DOIUrl":null,"url":null,"abstract":"Objective: Vision language models (VLMs) allow visual input to Large Language Models. VLMs have been developing rapidly, and their accuracy is improving rapidly. Their performance in nuclear medicine compared to state-of-the-art models, including reasoning models, is not yet clear. We evaluated state-of-the-art VLMs using problems from the past Japan Nuclear Medicine Board Examination (JNMBE) and assessed their strengths and limitations.Methods: We collected 180 multiple-choice questions from JNMBE (2022-2024). About one-third included diagnostic images. We used eight latest VLMs. ChatGPT o1 pro, ChatGPT o1, ChatGPT o3-mini, ChatGPT-4.5, Claude 3.7, Gemini 2.0 Flash thinking, Llama 3.2, and Gemma 3 were tested. Each model answered every question three times in a deterministic setting, and the final answer was set by majority vote. Two board-certified nuclear medicine physicians independently provided reference answers, with a third expert resolving disagreements. We calculated overall accuracy with 95% confidence intervals and performed subgroup analyses by question type, content, and exam year.Results: Overall accuracies ranged from 36.1% (Gemma 3) to 83.3% (ChatGPT o1 pro). ChatGPT o1 pro achieved the highest score (150/180, 83.3% [95% CI: 77.1-88.5%]), followed by ChatGPT o3-mini (82.8%) and ChatGPTo1 (78.9%). All models performed better on text-only questions than on image-based ones; ChatGPT o1 pro correctly answered 89.5% of text questions versus 66.0% of image questions. VLMs demonstrated limitations in handling with questions on Japanese regulations. ChatGPT 4.5 excelled in neurology-related image-based questions (76.9%). Accuracy was slightly lower from 2022 to 2024 for most models.Conclusions: VLMs demonstrated high accuracy on the JNMBE, especially on text-based questions, but exhibited limitations with image recognition questions. These findings show that VLMs can be a good assistant for text-based questions in medical domains but have limitations when it comes to comprehensive questions that include images. Currently, VLMs cannot replace comprehensive training and expert interpretation. Because VLMs evolve rapidly and exam difficulty varies annually, these findings should be interpreted in that context.","PeriodicalId":8007,"journal":{"name":"Annals of Nuclear Medicine","volume":" ","pages":""},"PeriodicalIF":2.5000,"publicationDate":"2025-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annals of Nuclear Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s12149-025-02084-x","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}

引用次数: 0

Abstract

Objective: Vision language models (VLMs) allow visual input to Large Language Models. VLMs have been developing rapidly, and their accuracy is improving rapidly. Their performance in nuclear medicine compared to state-of-the-art models, including reasoning models, is not yet clear. We evaluated state-of-the-art VLMs using problems from the past Japan Nuclear Medicine Board Examination (JNMBE) and assessed their strengths and limitations.

Methods: We collected 180 multiple-choice questions from JNMBE (2022-2024). About one-third included diagnostic images. We used eight latest VLMs. ChatGPT o1 pro, ChatGPT o1, ChatGPT o3-mini, ChatGPT-4.5, Claude 3.7, Gemini 2.0 Flash thinking, Llama 3.2, and Gemma 3 were tested. Each model answered every question three times in a deterministic setting, and the final answer was set by majority vote. Two board-certified nuclear medicine physicians independently provided reference answers, with a third expert resolving disagreements. We calculated overall accuracy with 95% confidence intervals and performed subgroup analyses by question type, content, and exam year.

Results: Overall accuracies ranged from 36.1% (Gemma 3) to 83.3% (ChatGPT o1 pro). ChatGPT o1 pro achieved the highest score (150/180, 83.3% [95% CI: 77.1-88.5%]), followed by ChatGPT o3-mini (82.8%) and ChatGPTo1 (78.9%). All models performed better on text-only questions than on image-based ones; ChatGPT o1 pro correctly answered 89.5% of text questions versus 66.0% of image questions. VLMs demonstrated limitations in handling with questions on Japanese regulations. ChatGPT 4.5 excelled in neurology-related image-based questions (76.9%). Accuracy was slightly lower from 2022 to 2024 for most models.

Conclusions: VLMs demonstrated high accuracy on the JNMBE, especially on text-based questions, but exhibited limitations with image recognition questions. These findings show that VLMs can be a good assistant for text-based questions in medical domains but have limitations when it comes to comprehensive questions that include images. Currently, VLMs cannot replace comprehensive training and expert interpretation. Because VLMs evolve rapidly and exam difficulty varies annually, these findings should be interpreted in that context.

查看原文本刊更多论文

视觉语言模型在日本核医学委员会考试中的表现：文本准确性高，但图像解释存在挑战。

目的：视觉语言模型（VLMs）允许对大型语言模型进行视觉输入。VLMs发展迅速，精度也在迅速提高。与最先进的模型（包括推理模型）相比，它们在核医学中的表现尚不清楚。我们利用过去日本核医学委员会考试（JNMBE）中的问题评估了最先进的VLMs，并评估了它们的优势和局限性。方法：收集JNMBE（2022-2024）试题180道选择题。大约三分之一包括诊断图像。我们使用了8个最新的vlm。ChatGPT 01 pro、ChatGPT 01、ChatGPT 03 -mini、ChatGPT-4.5、Claude 3.7、Gemini 2.0 Flash thinking、Llama 3.2和Gemma 3进行了测试。每个模型在确定性设置中回答每个问题三次，最终答案由多数投票确定。两名委员会认证的核医学医生独立提供参考答案，第三名专家解决分歧。我们以95%的置信区间计算总体准确率，并按题目类型、内容和考试年份进行亚组分析。结果：总体准确率为36.1% (Gemma 3) ~ 83.3% （ChatGPT 01 pro）。ChatGPTo1 pro得分最高（150/180,83.3% [95% CI: 77.1-88.5%]），其次是ChatGPT o3-mini（82.8%）和ChatGPTo1（78.9%）。所有模型在纯文本问题上的表现都比基于图像的问题好；ChatGPT 01 pro答对了89.5%的文字问题和66.0%的图片问题。VLMs在处理有关日本法规的问题方面显示出局限性。ChatGPT 4.5在神经学相关的基于图像的问题中表现出色（76.9%）。从2022年到2024年，大多数模型的准确率略低。结论：VLMs在JNMBE上表现出很高的准确性，特别是在基于文本的问题上，但在图像识别问题上表现出局限性。这些发现表明，vlm可以成为医学领域基于文本的问题的一个很好的助手，但当涉及到包括图像的综合问题时，它有局限性。目前，VLMs还不能取代全面的培训和专家解释。由于vlm发展迅速，考试难度每年都在变化，因此这些发现应该在此背景下解释。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Annals of Nuclear Medicine 医学-核医学

CiteScore

4.90

自引率

7.70%

发文量

111

审稿时长

4-8 weeks

期刊介绍： Annals of Nuclear Medicine is an official journal of the Japanese Society of Nuclear Medicine. It develops the appropriate application of radioactive substances and stable nuclides in the field of medicine. The journal promotes the exchange of ideas and information and research in nuclear medicine and includes the medical application of radionuclides and related subjects. It presents original articles, short communications, reviews and letters to the editor.