Assessing accuracy and legitimacy of multimodal large language models on Japan Diagnostic Radiology Board Examination.

IF 2.1 4区医学

Japanese Journal of Radiology Pub Date : 2025-09-12 DOI:10.1007/s11604-025-01861-y

Yuichiro Hirano, Soichiro Miki, Yosuke Yamagishi, Shouhei Hanaoka, Takahiro Nakao, Tomohiro Kikuchi, Yuta Nakamura, Yukihiro Nomura, Takeharu Yoshikawa, Osamu Abe

{"title":"Assessing accuracy and legitimacy of multimodal large language models on Japan Diagnostic Radiology Board Examination.","authors":"Yuichiro Hirano, Soichiro Miki, Yosuke Yamagishi, Shouhei Hanaoka, Takahiro Nakao, Tomohiro Kikuchi, Yuta Nakamura, Yukihiro Nomura, Takeharu Yoshikawa, Osamu Abe","doi":"10.1007/s11604-025-01861-y","DOIUrl":null,"url":null,"abstract":"Purpose: To assess and compare the accuracy and legitimacy of multimodal large language models (LLMs) on the Japan Diagnostic Radiology Board Examination (JDRBE).Materials and methods: The dataset comprised questions from JDRBE 2021, 2023, and 2024, with ground-truth answers established through consensus among multiple board-certified diagnostic radiologists. Questions without associated images and those lacking unanimous agreement on answers were excluded. Eight LLMs were evaluated: GPT-4 Turbo, GPT-4o, GPT-4.5, GPT-4.1, o3, o4-mini, Claude 3.7 Sonnet, and Gemini 2.5 Pro. Each model was evaluated under two conditions: with inputting images (vision) and without (text-only). Performance differences between the conditions were assessed using McNemar's exact test. Two diagnostic radiologists (with 2 and 18 years of experience) independently rated the legitimacy of responses from four models (GPT-4 Turbo, Claude 3.7 Sonnet, o3, and Gemini 2.5 Pro) using a five-point Likert scale, blinded to model identity. Legitimacy scores were analyzed using Friedman's test, followed by pairwise Wilcoxon signed-rank tests with Holm correction.Results: The dataset included 233 questions. Under the vision condition, o3 achieved the highest accuracy at 72%, followed by o4-mini (70%) and Gemini 2.5 Pro (70%). Under the text-only condition, o3 topped the list with an accuracy of 67%. Addition of image input significantly improved the accuracy of two models (Gemini 2.5 Pro and GPT-4.5), but not the others. Both o3 and Gemini 2.5 Pro received significantly higher legitimacy scores than GPT-4 Turbo and Claude 3.7 Sonnet from both raters.Conclusion: Recent multimodal LLMs, particularly o3 and Gemini 2.5 Pro, have demonstrated remarkable progress on JDRBE questions, reflecting their rapid evolution in diagnostic radiology. Eight multimodal large language models were evaluated on the Japan Diagnostic Radiology Board Examination. OpenAI's o3 and Google DeepMind's Gemini 2.5 Pro achieved high accuracy rates (72% and 70%) and received good legitimacy scores from human raters, demonstrating steady progress.","PeriodicalId":14691,"journal":{"name":"Japanese Journal of Radiology","volume":" ","pages":""},"PeriodicalIF":2.1000,"publicationDate":"2025-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Japanese Journal of Radiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s11604-025-01861-y","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose: To assess and compare the accuracy and legitimacy of multimodal large language models (LLMs) on the Japan Diagnostic Radiology Board Examination (JDRBE).

Materials and methods: The dataset comprised questions from JDRBE 2021, 2023, and 2024, with ground-truth answers established through consensus among multiple board-certified diagnostic radiologists. Questions without associated images and those lacking unanimous agreement on answers were excluded. Eight LLMs were evaluated: GPT-4 Turbo, GPT-4o, GPT-4.5, GPT-4.1, o3, o4-mini, Claude 3.7 Sonnet, and Gemini 2.5 Pro. Each model was evaluated under two conditions: with inputting images (vision) and without (text-only). Performance differences between the conditions were assessed using McNemar's exact test. Two diagnostic radiologists (with 2 and 18 years of experience) independently rated the legitimacy of responses from four models (GPT-4 Turbo, Claude 3.7 Sonnet, o3, and Gemini 2.5 Pro) using a five-point Likert scale, blinded to model identity. Legitimacy scores were analyzed using Friedman's test, followed by pairwise Wilcoxon signed-rank tests with Holm correction.

Results: The dataset included 233 questions. Under the vision condition, o3 achieved the highest accuracy at 72%, followed by o4-mini (70%) and Gemini 2.5 Pro (70%). Under the text-only condition, o3 topped the list with an accuracy of 67%. Addition of image input significantly improved the accuracy of two models (Gemini 2.5 Pro and GPT-4.5), but not the others. Both o3 and Gemini 2.5 Pro received significantly higher legitimacy scores than GPT-4 Turbo and Claude 3.7 Sonnet from both raters.

Conclusion: Recent multimodal LLMs, particularly o3 and Gemini 2.5 Pro, have demonstrated remarkable progress on JDRBE questions, reflecting their rapid evolution in diagnostic radiology. Eight multimodal large language models were evaluated on the Japan Diagnostic Radiology Board Examination. OpenAI's o3 and Google DeepMind's Gemini 2.5 Pro achieved high accuracy rates (72% and 70%) and received good legitimacy scores from human raters, demonstrating steady progress.

查看原文本刊更多论文

评估日本诊断放射学委员会检查中多模态大语言模型的准确性和合法性。

目的：评估和比较日本诊断放射学委员会考试（JDRBE）中多模态大语言模型（LLMs）的准确性和合法性。材料和方法：该数据集包括JDRBE 2021年、2023年和2024年的问题，并通过多名委员会认证的诊断放射科医生的共识建立了基本事实答案。没有相关图片的问题和对答案缺乏一致意见的问题被排除在外。评估了8个llm: GPT-4 Turbo、gpt - 40、GPT-4.5、GPT-4.1、o3、o4-mini、Claude 3.7 Sonnet和Gemini 2.5 Pro。每个模型在两种情况下进行评估：有输入图像（视觉）和没有输入图像（纯文本）。使用McNemar精确测试评估不同条件下的性能差异。两名诊断放射科医生（分别有2年和18年的经验）使用李克特五分制对四种模型（GPT-4 Turbo、Claude 3.7 Sonnet、o3和Gemini 2.5 Pro）的反应的合法性进行了独立评级，对模型身份不知情。合法性得分分析采用弗里德曼检验，随后两两Wilcoxon sign -rank检验与Holm校正。结果：数据集包括233个问题。在视力条件下，o3的准确率最高，达到72%，其次是o4-mini（70%）和Gemini 2.5 Pro（70%）。在纯文本条件下，o3以67%的准确率名列榜首。图像输入的增加显著提高了Gemini 2.5 Pro和GPT-4.5两种模型的准确率，而其他模型则没有。o3和Gemini 2.5 Pro的合法性得分明显高于GPT-4 Turbo和Claude 3.7 Sonnet。结论：最近的多模态llm，特别是o3和Gemini 2.5 Pro，在JDRBE问题上取得了显著进展，反映了它们在诊断放射学中的快速发展。在日本诊断放射学委员会的检查中评估了8种多模态大语言模型。OpenAI的o3和bb0 DeepMind的Gemini 2.5 Pro获得了很高的准确率（72%和70%），并在人类评分者中获得了良好的合法性评分，显示出稳步的进步。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Japanese Journal of Radiology Medicine-Radiology, Nuclear Medicine and Imaging

自引率

4.80%

发文量

133

期刊介绍： Japanese Journal of Radiology is a peer-reviewed journal, officially published by the Japan Radiological Society. The main purpose of the journal is to provide a forum for the publication of papers documenting recent advances and new developments in the field of radiology in medicine and biology. The scope of Japanese Journal of Radiology encompasses but is not restricted to diagnostic radiology, interventional radiology, radiation oncology, nuclear medicine, radiation physics, and radiation biology. Additionally, the journal covers technical and industrial innovations. The journal welcomes original articles, technical notes, review articles, pictorial essays and letters to the editor. The journal also provides announcements from the boards and the committees of the society. Membership in the Japan Radiological Society is not a prerequisite for submission. Contributions are welcomed from all parts of the world.