{"title":"视觉语言模型在日本核医学委员会考试中的表现:文本准确性高,但图像解释存在挑战。","authors":"Rintaro Ito, Keita Kato, Marina Higashi, Yumi Abe, Ryogo Minamimoto, Katsuhiko Kato, Toshiaki Taoka, Shinji Naganawa","doi":"10.1007/s12149-025-02084-x","DOIUrl":null,"url":null,"abstract":"<p><strong>Objective: </strong>Vision language models (VLMs) allow visual input to Large Language Models. VLMs have been developing rapidly, and their accuracy is improving rapidly. Their performance in nuclear medicine compared to state-of-the-art models, including reasoning models, is not yet clear. We evaluated state-of-the-art VLMs using problems from the past Japan Nuclear Medicine Board Examination (JNMBE) and assessed their strengths and limitations.</p><p><strong>Methods: </strong>We collected 180 multiple-choice questions from JNMBE (2022-2024). About one-third included diagnostic images. We used eight latest VLMs. ChatGPT o1 pro, ChatGPT o1, ChatGPT o3-mini, ChatGPT-4.5, Claude 3.7, Gemini 2.0 Flash thinking, Llama 3.2, and Gemma 3 were tested. Each model answered every question three times in a deterministic setting, and the final answer was set by majority vote. Two board-certified nuclear medicine physicians independently provided reference answers, with a third expert resolving disagreements. We calculated overall accuracy with 95% confidence intervals and performed subgroup analyses by question type, content, and exam year.</p><p><strong>Results: </strong>Overall accuracies ranged from 36.1% (Gemma 3) to 83.3% (ChatGPT o1 pro). ChatGPT o1 pro achieved the highest score (150/180, 83.3% [95% CI: 77.1-88.5%]), followed by ChatGPT o3-mini (82.8%) and ChatGPTo1 (78.9%). All models performed better on text-only questions than on image-based ones; ChatGPT o1 pro correctly answered 89.5% of text questions versus 66.0% of image questions. VLMs demonstrated limitations in handling with questions on Japanese regulations. ChatGPT 4.5 excelled in neurology-related image-based questions (76.9%). Accuracy was slightly lower from 2022 to 2024 for most models.</p><p><strong>Conclusions: </strong>VLMs demonstrated high accuracy on the JNMBE, especially on text-based questions, but exhibited limitations with image recognition questions. These findings show that VLMs can be a good assistant for text-based questions in medical domains but have limitations when it comes to comprehensive questions that include images. Currently, VLMs cannot replace comprehensive training and expert interpretation. Because VLMs evolve rapidly and exam difficulty varies annually, these findings should be interpreted in that context.</p>","PeriodicalId":8007,"journal":{"name":"Annals of Nuclear Medicine","volume":" ","pages":""},"PeriodicalIF":2.5000,"publicationDate":"2025-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Vision-language model performance on the Japanese Nuclear Medicine Board Examination: high accuracy in text but challenges with image interpretation.\",\"authors\":\"Rintaro Ito, Keita Kato, Marina Higashi, Yumi Abe, Ryogo Minamimoto, Katsuhiko Kato, Toshiaki Taoka, Shinji Naganawa\",\"doi\":\"10.1007/s12149-025-02084-x\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Objective: </strong>Vision language models (VLMs) allow visual input to Large Language Models. VLMs have been developing rapidly, and their accuracy is improving rapidly. Their performance in nuclear medicine compared to state-of-the-art models, including reasoning models, is not yet clear. We evaluated state-of-the-art VLMs using problems from the past Japan Nuclear Medicine Board Examination (JNMBE) and assessed their strengths and limitations.</p><p><strong>Methods: </strong>We collected 180 multiple-choice questions from JNMBE (2022-2024). About one-third included diagnostic images. We used eight latest VLMs. ChatGPT o1 pro, ChatGPT o1, ChatGPT o3-mini, ChatGPT-4.5, Claude 3.7, Gemini 2.0 Flash thinking, Llama 3.2, and Gemma 3 were tested. Each model answered every question three times in a deterministic setting, and the final answer was set by majority vote. Two board-certified nuclear medicine physicians independently provided reference answers, with a third expert resolving disagreements. We calculated overall accuracy with 95% confidence intervals and performed subgroup analyses by question type, content, and exam year.</p><p><strong>Results: </strong>Overall accuracies ranged from 36.1% (Gemma 3) to 83.3% (ChatGPT o1 pro). ChatGPT o1 pro achieved the highest score (150/180, 83.3% [95% CI: 77.1-88.5%]), followed by ChatGPT o3-mini (82.8%) and ChatGPTo1 (78.9%). All models performed better on text-only questions than on image-based ones; ChatGPT o1 pro correctly answered 89.5% of text questions versus 66.0% of image questions. VLMs demonstrated limitations in handling with questions on Japanese regulations. ChatGPT 4.5 excelled in neurology-related image-based questions (76.9%). Accuracy was slightly lower from 2022 to 2024 for most models.</p><p><strong>Conclusions: </strong>VLMs demonstrated high accuracy on the JNMBE, especially on text-based questions, but exhibited limitations with image recognition questions. These findings show that VLMs can be a good assistant for text-based questions in medical domains but have limitations when it comes to comprehensive questions that include images. Currently, VLMs cannot replace comprehensive training and expert interpretation. Because VLMs evolve rapidly and exam difficulty varies annually, these findings should be interpreted in that context.</p>\",\"PeriodicalId\":8007,\"journal\":{\"name\":\"Annals of Nuclear Medicine\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":2.5000,\"publicationDate\":\"2025-07-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Annals of Nuclear Medicine\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1007/s12149-025-02084-x\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annals of Nuclear Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s12149-025-02084-x","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
Vision-language model performance on the Japanese Nuclear Medicine Board Examination: high accuracy in text but challenges with image interpretation.
Objective: Vision language models (VLMs) allow visual input to Large Language Models. VLMs have been developing rapidly, and their accuracy is improving rapidly. Their performance in nuclear medicine compared to state-of-the-art models, including reasoning models, is not yet clear. We evaluated state-of-the-art VLMs using problems from the past Japan Nuclear Medicine Board Examination (JNMBE) and assessed their strengths and limitations.
Methods: We collected 180 multiple-choice questions from JNMBE (2022-2024). About one-third included diagnostic images. We used eight latest VLMs. ChatGPT o1 pro, ChatGPT o1, ChatGPT o3-mini, ChatGPT-4.5, Claude 3.7, Gemini 2.0 Flash thinking, Llama 3.2, and Gemma 3 were tested. Each model answered every question three times in a deterministic setting, and the final answer was set by majority vote. Two board-certified nuclear medicine physicians independently provided reference answers, with a third expert resolving disagreements. We calculated overall accuracy with 95% confidence intervals and performed subgroup analyses by question type, content, and exam year.
Results: Overall accuracies ranged from 36.1% (Gemma 3) to 83.3% (ChatGPT o1 pro). ChatGPT o1 pro achieved the highest score (150/180, 83.3% [95% CI: 77.1-88.5%]), followed by ChatGPT o3-mini (82.8%) and ChatGPTo1 (78.9%). All models performed better on text-only questions than on image-based ones; ChatGPT o1 pro correctly answered 89.5% of text questions versus 66.0% of image questions. VLMs demonstrated limitations in handling with questions on Japanese regulations. ChatGPT 4.5 excelled in neurology-related image-based questions (76.9%). Accuracy was slightly lower from 2022 to 2024 for most models.
Conclusions: VLMs demonstrated high accuracy on the JNMBE, especially on text-based questions, but exhibited limitations with image recognition questions. These findings show that VLMs can be a good assistant for text-based questions in medical domains but have limitations when it comes to comprehensive questions that include images. Currently, VLMs cannot replace comprehensive training and expert interpretation. Because VLMs evolve rapidly and exam difficulty varies annually, these findings should be interpreted in that context.
期刊介绍:
Annals of Nuclear Medicine is an official journal of the Japanese Society of Nuclear Medicine. It develops the appropriate application of radioactive substances and stable nuclides in the field of medicine.
The journal promotes the exchange of ideas and information and research in nuclear medicine and includes the medical application of radionuclides and related subjects. It presents original articles, short communications, reviews and letters to the editor.