Vision-language model performance on the Japanese Nuclear Medicine Board Examination: high accuracy in text but challenges with image interpretation.

IF 2.5 4区 医学 Q2 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING
Rintaro Ito, Keita Kato, Marina Higashi, Yumi Abe, Ryogo Minamimoto, Katsuhiko Kato, Toshiaki Taoka, Shinji Naganawa
{"title":"Vision-language model performance on the Japanese Nuclear Medicine Board Examination: high accuracy in text but challenges with image interpretation.","authors":"Rintaro Ito, Keita Kato, Marina Higashi, Yumi Abe, Ryogo Minamimoto, Katsuhiko Kato, Toshiaki Taoka, Shinji Naganawa","doi":"10.1007/s12149-025-02084-x","DOIUrl":null,"url":null,"abstract":"<p><strong>Objective: </strong>Vision language models (VLMs) allow visual input to Large Language Models. VLMs have been developing rapidly, and their accuracy is improving rapidly. Their performance in nuclear medicine compared to state-of-the-art models, including reasoning models, is not yet clear. We evaluated state-of-the-art VLMs using problems from the past Japan Nuclear Medicine Board Examination (JNMBE) and assessed their strengths and limitations.</p><p><strong>Methods: </strong>We collected 180 multiple-choice questions from JNMBE (2022-2024). About one-third included diagnostic images. We used eight latest VLMs. ChatGPT o1 pro, ChatGPT o1, ChatGPT o3-mini, ChatGPT-4.5, Claude 3.7, Gemini 2.0 Flash thinking, Llama 3.2, and Gemma 3 were tested. Each model answered every question three times in a deterministic setting, and the final answer was set by majority vote. Two board-certified nuclear medicine physicians independently provided reference answers, with a third expert resolving disagreements. We calculated overall accuracy with 95% confidence intervals and performed subgroup analyses by question type, content, and exam year.</p><p><strong>Results: </strong>Overall accuracies ranged from 36.1% (Gemma 3) to 83.3% (ChatGPT o1 pro). ChatGPT o1 pro achieved the highest score (150/180, 83.3% [95% CI: 77.1-88.5%]), followed by ChatGPT o3-mini (82.8%) and ChatGPTo1 (78.9%). All models performed better on text-only questions than on image-based ones; ChatGPT o1 pro correctly answered 89.5% of text questions versus 66.0% of image questions. VLMs demonstrated limitations in handling with questions on Japanese regulations. ChatGPT 4.5 excelled in neurology-related image-based questions (76.9%). Accuracy was slightly lower from 2022 to 2024 for most models.</p><p><strong>Conclusions: </strong>VLMs demonstrated high accuracy on the JNMBE, especially on text-based questions, but exhibited limitations with image recognition questions. These findings show that VLMs can be a good assistant for text-based questions in medical domains but have limitations when it comes to comprehensive questions that include images. Currently, VLMs cannot replace comprehensive training and expert interpretation. Because VLMs evolve rapidly and exam difficulty varies annually, these findings should be interpreted in that context.</p>","PeriodicalId":8007,"journal":{"name":"Annals of Nuclear Medicine","volume":" ","pages":""},"PeriodicalIF":2.5000,"publicationDate":"2025-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annals of Nuclear Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s12149-025-02084-x","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0

Abstract

Objective: Vision language models (VLMs) allow visual input to Large Language Models. VLMs have been developing rapidly, and their accuracy is improving rapidly. Their performance in nuclear medicine compared to state-of-the-art models, including reasoning models, is not yet clear. We evaluated state-of-the-art VLMs using problems from the past Japan Nuclear Medicine Board Examination (JNMBE) and assessed their strengths and limitations.

Methods: We collected 180 multiple-choice questions from JNMBE (2022-2024). About one-third included diagnostic images. We used eight latest VLMs. ChatGPT o1 pro, ChatGPT o1, ChatGPT o3-mini, ChatGPT-4.5, Claude 3.7, Gemini 2.0 Flash thinking, Llama 3.2, and Gemma 3 were tested. Each model answered every question three times in a deterministic setting, and the final answer was set by majority vote. Two board-certified nuclear medicine physicians independently provided reference answers, with a third expert resolving disagreements. We calculated overall accuracy with 95% confidence intervals and performed subgroup analyses by question type, content, and exam year.

Results: Overall accuracies ranged from 36.1% (Gemma 3) to 83.3% (ChatGPT o1 pro). ChatGPT o1 pro achieved the highest score (150/180, 83.3% [95% CI: 77.1-88.5%]), followed by ChatGPT o3-mini (82.8%) and ChatGPTo1 (78.9%). All models performed better on text-only questions than on image-based ones; ChatGPT o1 pro correctly answered 89.5% of text questions versus 66.0% of image questions. VLMs demonstrated limitations in handling with questions on Japanese regulations. ChatGPT 4.5 excelled in neurology-related image-based questions (76.9%). Accuracy was slightly lower from 2022 to 2024 for most models.

Conclusions: VLMs demonstrated high accuracy on the JNMBE, especially on text-based questions, but exhibited limitations with image recognition questions. These findings show that VLMs can be a good assistant for text-based questions in medical domains but have limitations when it comes to comprehensive questions that include images. Currently, VLMs cannot replace comprehensive training and expert interpretation. Because VLMs evolve rapidly and exam difficulty varies annually, these findings should be interpreted in that context.

视觉语言模型在日本核医学委员会考试中的表现:文本准确性高,但图像解释存在挑战。
目的:视觉语言模型(VLMs)允许对大型语言模型进行视觉输入。VLMs发展迅速,精度也在迅速提高。与最先进的模型(包括推理模型)相比,它们在核医学中的表现尚不清楚。我们利用过去日本核医学委员会考试(JNMBE)中的问题评估了最先进的VLMs,并评估了它们的优势和局限性。方法:收集JNMBE(2022-2024)试题180道选择题。大约三分之一包括诊断图像。我们使用了8个最新的vlm。ChatGPT 01 pro、ChatGPT 01、ChatGPT 03 -mini、ChatGPT-4.5、Claude 3.7、Gemini 2.0 Flash thinking、Llama 3.2和Gemma 3进行了测试。每个模型在确定性设置中回答每个问题三次,最终答案由多数投票确定。两名委员会认证的核医学医生独立提供参考答案,第三名专家解决分歧。我们以95%的置信区间计算总体准确率,并按题目类型、内容和考试年份进行亚组分析。结果:总体准确率为36.1% (Gemma 3) ~ 83.3% (ChatGPT 01 pro)。ChatGPTo1 pro得分最高(150/180,83.3% [95% CI: 77.1-88.5%]),其次是ChatGPT o3-mini(82.8%)和ChatGPTo1(78.9%)。所有模型在纯文本问题上的表现都比基于图像的问题好;ChatGPT 01 pro答对了89.5%的文字问题和66.0%的图片问题。VLMs在处理有关日本法规的问题方面显示出局限性。ChatGPT 4.5在神经学相关的基于图像的问题中表现出色(76.9%)。从2022年到2024年,大多数模型的准确率略低。结论:VLMs在JNMBE上表现出很高的准确性,特别是在基于文本的问题上,但在图像识别问题上表现出局限性。这些发现表明,vlm可以成为医学领域基于文本的问题的一个很好的助手,但当涉及到包括图像的综合问题时,它有局限性。目前,VLMs还不能取代全面的培训和专家解释。由于vlm发展迅速,考试难度每年都在变化,因此这些发现应该在此背景下解释。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Annals of Nuclear Medicine
Annals of Nuclear Medicine 医学-核医学
CiteScore
4.90
自引率
7.70%
发文量
111
审稿时长
4-8 weeks
期刊介绍: Annals of Nuclear Medicine is an official journal of the Japanese Society of Nuclear Medicine. It develops the appropriate application of radioactive substances and stable nuclides in the field of medicine. The journal promotes the exchange of ideas and information and research in nuclear medicine and includes the medical application of radionuclides and related subjects. It presents original articles, short communications, reviews and letters to the editor.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信