Performance of multimodal large language models in the Japanese surgical specialist examination.

IF 3.2 2区医学 Q1 EDUCATION & EDUCATIONAL RESEARCH

BMC Medical Education Pub Date : 2025-10-09 DOI:10.1186/s12909-025-07938-6

Yuji Miyamoto, Takeshi Nakaura, Hiro Nakamura, Toshinori Hirai, Masaaki Iwatsuki

{"title":"Performance of multimodal large language models in the Japanese surgical specialist examination.","authors":"Yuji Miyamoto, Takeshi Nakaura, Hiro Nakamura, Toshinori Hirai, Masaaki Iwatsuki","doi":"10.1186/s12909-025-07938-6","DOIUrl":null,"url":null,"abstract":"Background: Multimodal large language models (LLMs) have the capability to process and integrate both text and image data, offering promising applications in the medical field. This study aimed to evaluate the performance of representative multimodal LLMs in the 2023 Japanese Surgical Specialist Examination, with a focus on image-based questions across various surgical subspecialties. METHODS: A total of 98 examination questions, including 43 image-based questions, from the 2023 Japanese Surgical Specialist Examination were administered to three multimodal LLMs: GPT-4 Omni, Claude 3.5 Sonnet, and Gemini Pro 1.5. Each model's performance was assessed under two conditions: with and without images. Statistical analysis was conducted using McNemar's test to evaluate the significance of accuracy differences between the two conditions. RESULTS: Among the three LLMs, Claude 3.5 Sonnet achieved the highest overall accuracy at 84.69%, exceeding the passing threshold of 80%, which is consistent with the standard set by the Japan Surgical Society for board certification. GPT-4 Omni closely approached the threshold with an accuracy of 79.59%, while Gemini Pro 1.5 scored 61.22%. Claude 3.5 Sonnet demonstrated the highest accuracy in four of six subspecialties for image-based questions and was the only model to show a statistically significant improvement with image inclusion (76.74% with images vs. 62.79% without images, p = 0.041). By contrast, GPT-4 Omni and Gemini Pro 1.5 did not exhibit significant performance changes with image inclusion.Conclusion: Claude 3.5 Sonnet outperformed the other models in most surgical subspecialties for image-based questions and was the only model to benefit significantly from image inclusion. These findings suggest that multimodal LLMs, particularly Claude 3.5 Sonnet, hold promise as diagnostic and educational support tools in surgical domains, and that variation in visual reasoning capabilities may account for model-level differences in image-based performance.","PeriodicalId":51234,"journal":{"name":"BMC Medical Education","volume":"25 1","pages":"1379"},"PeriodicalIF":3.2000,"publicationDate":"2025-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12513120/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Medical Education","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12909-025-07938-6","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION & EDUCATIONAL RESEARCH","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Multimodal large language models (LLMs) have the capability to process and integrate both text and image data, offering promising applications in the medical field. This study aimed to evaluate the performance of representative multimodal LLMs in the 2023 Japanese Surgical Specialist Examination, with a focus on image-based questions across various surgical subspecialties. METHODS: A total of 98 examination questions, including 43 image-based questions, from the 2023 Japanese Surgical Specialist Examination were administered to three multimodal LLMs: GPT-4 Omni, Claude 3.5 Sonnet, and Gemini Pro 1.5. Each model's performance was assessed under two conditions: with and without images. Statistical analysis was conducted using McNemar's test to evaluate the significance of accuracy differences between the two conditions. RESULTS: Among the three LLMs, Claude 3.5 Sonnet achieved the highest overall accuracy at 84.69%, exceeding the passing threshold of 80%, which is consistent with the standard set by the Japan Surgical Society for board certification. GPT-4 Omni closely approached the threshold with an accuracy of 79.59%, while Gemini Pro 1.5 scored 61.22%. Claude 3.5 Sonnet demonstrated the highest accuracy in four of six subspecialties for image-based questions and was the only model to show a statistically significant improvement with image inclusion (76.74% with images vs. 62.79% without images, p = 0.041). By contrast, GPT-4 Omni and Gemini Pro 1.5 did not exhibit significant performance changes with image inclusion.

Conclusion: Claude 3.5 Sonnet outperformed the other models in most surgical subspecialties for image-based questions and was the only model to benefit significantly from image inclusion. These findings suggest that multimodal LLMs, particularly Claude 3.5 Sonnet, hold promise as diagnostic and educational support tools in surgical domains, and that variation in visual reasoning capabilities may account for model-level differences in image-based performance.

查看原文本刊更多论文

多模态大语言模型在日语外科专科考试中的表现。

背景：多模态大语言模型（llm）具有处理和集成文本和图像数据的能力，在医学领域有很好的应用前景。本研究旨在评估具有代表性的多模式法学硕士在2023年日本外科专科考试中的表现，重点关注不同外科专科的基于图像的问题。方法：共98道试题，包括43道基于图像的试题，来自2023年日本外科专科医师考试，对三个多模式LLMs: GPT-4 Omni， Claude 3.5 Sonnet和Gemini Pro 1.5。每个模型的性能在两种情况下进行评估：有图像和没有图像。采用McNemar检验进行统计分析，评价两种情况下准确度差异的显著性。结果：在3个llm中，Claude 3.5 Sonnet的整体准确率最高，达到84.69%，超过80%的合格阈值，符合日本外科学会制定的board认证标准。GPT-4 Omni以79.59%的准确率接近阈值，而Gemini Pro 1.5的准确率为61.22%。在基于图像的问题中，Claude 3.5 Sonnet在六个亚专业中的四个专业中表现出最高的准确性，并且是唯一一个在包含图像后表现出统计学显著改善的模型（76.74%的图像vs. 62.79%的图像，p = 0.041）。相比之下，GPT-4 Omni和Gemini Pro 1.5在包含图像后没有表现出显着的性能变化。结论：Claude 3.5 Sonnet在大多数外科亚专科的图像问题中表现优于其他模型，并且是唯一从图像包含中显著受益的模型。这些发现表明，多模态llm，特别是Claude 3.5 Sonnet，有望成为外科领域的诊断和教育支持工具，并且视觉推理能力的变化可能解释了基于图像的表现的模型水平差异。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

BMC Medical Education EDUCATION, SCIENTIFIC DISCIPLINES-

CiteScore

4.90

自引率

11.10%

发文量

795

审稿时长

6 months

期刊介绍： BMC Medical Education is an open access journal publishing original peer-reviewed research articles in relation to the training of healthcare professionals, including undergraduate, postgraduate, and continuing education. The journal has a special focus on curriculum development, evaluations of performance, assessment of training needs and evidence-based medicine.