多模态基础模型与放射科医师对具有挑战性的神经放射学病例的文本和图像诊断的比较。

IF 8.1 2区医学 Q1 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING

Diagnostic and Interventional Imaging Pub Date : 2025-10-01 DOI:10.1016/j.diii.2025.04.006

Bastien Le Guellec , Cyril Bruge , Najib Chalhoub , Victor Chaton , Edouard De Sousa , Yann Gaillandre , Riyad Hanafi , Matthieu Masy , Quentin Vannod-Michel , Aghiles Hamroun , Grégory Kuchcinski , ARIANES investigators

{"title":"多模态基础模型与放射科医师对具有挑战性的神经放射学病例的文本和图像诊断的比较。","authors":"Bastien Le Guellec , Cyril Bruge , Najib Chalhoub , Victor Chaton , Edouard De Sousa , Yann Gaillandre , Riyad Hanafi , Matthieu Masy , Quentin Vannod-Michel , Aghiles Hamroun , Grégory Kuchcinski , ARIANES investigators","doi":"10.1016/j.diii.2025.04.006","DOIUrl":null,"url":null,"abstract":"<div><h3>Purpose</h3><div>The purpose of this study was to compare the ability of two multimodal models (GPT-4o and Gemini 1.5 Pro) with that of radiologists to generate differential diagnoses from textual context alone, key images alone, or a combination of both using complex neuroradiology cases.</div></div><div><h3>Materials and methods</h3><div>This retrospective study included neuroradiology cases from the \"<em>Diagnosis Please</em>\" series published in the <em>Radiology</em> journal between January 2008 and September 2024. The two multimodal models were asked to provide three differential diagnoses from textual context alone, key images alone, or the complete case. Six board-certified neuroradiologists solved the cases in the same setting, randomly assigned to two groups: context alone first and images alone first. Three radiologists solved the cases without, and then with the assistance of Gemini 1.5 Pro. An independent radiologist evaluated the quality of the image descriptions provided by GPT-4o and Gemini for each case. Differences in correct answers between multimodal models and radiologists were analyzed using McNemar test.</div></div><div><h3>Results</h3><div>GPT-4o and Gemini 1.5 Pro outperformed radiologists using clinical context alone (mean accuracy, 34.0 % [18/53] and 44.7 % [23.7/53] <em>vs</em>. 16.4 % [8.7/53]; both <em>P</em> < 0.01). Radiologists outperformed GPT-4o and Gemini 1.5 Pro using images alone (mean accuracy, 42.0 % [22.3/53] <em>vs</em>. 3.8 % [2/53], and 7.5 % [4/53]; both <em>P</em> < 0.01) and the complete cases (48.0 % [25.6/53] <em>vs</em>. 34.0 % [18/53], and 38.7 % [20.3/53]; both <em>P</em> < 0.001). While radiologists improved their accuracy when combining multimodal information (from 42.1 % [22.3/53] for images alone to 50.3 % [26.7/53] for complete cases; <em>P</em> < 0.01), GPT-4o and Gemini 1.5 Pro did not benefit from the multimodal context (from 34.0 % [18/53] for text alone to 35.2 % [18.7/53] for complete cases for GPT-4o; <em>P</em> = 0.48, and from 44.7 % [23.7/53] to 42.8 % [22.7/53] for Gemini 1.5 Pro; <em>P</em> = 0.54). Radiologists benefited significantly from the suggestion of Gemini 1.5 Pro, increasing their accuracy from 47.2 % [25/53] to 56.0 % [27/53] (<em>P</em> < 0.01). Both GPT-4o and Gemini 1.5 Pro correctly identified the imaging modality in 53/53 (100 %) and 51/53 (96.2 %) cases, respectively, but frequently failed to identify key imaging findings (43/53 cases [81.1 %] with incorrect identification of key imaging findings for GPT-4o and 50/53 [94.3 %] for Gemini 1.5).</div></div><div><h3>Conclusion</h3><div>Radiologists show a specific ability to benefit from the integration of textual and visual information, whereas multimodal models mostly rely on the clinical context to suggest diagnoses.</div></div>","PeriodicalId":48656,"journal":{"name":"Diagnostic and Interventional Imaging","volume":"106 10","pages":"Pages 345-352"},"PeriodicalIF":8.1000,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Comparison between multimodal foundation models and radiologists for the diagnosis of challenging neuroradiology cases with text and images\",\"authors\":\"Bastien Le Guellec , Cyril Bruge , Najib Chalhoub , Victor Chaton , Edouard De Sousa , Yann Gaillandre , Riyad Hanafi , Matthieu Masy , Quentin Vannod-Michel , Aghiles Hamroun , Grégory Kuchcinski , ARIANES investigators\",\"doi\":\"10.1016/j.diii.2025.04.006\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Purpose</h3><div>The purpose of this study was to compare the ability of two multimodal models (GPT-4o and Gemini 1.5 Pro) with that of radiologists to generate differential diagnoses from textual context alone, key images alone, or a combination of both using complex neuroradiology cases.</div></div><div><h3>Materials and methods</h3><div>This retrospective study included neuroradiology cases from the \\\"<em>Diagnosis Please</em>\\\" series published in the <em>Radiology</em> journal between January 2008 and September 2024. The two multimodal models were asked to provide three differential diagnoses from textual context alone, key images alone, or the complete case. Six board-certified neuroradiologists solved the cases in the same setting, randomly assigned to two groups: context alone first and images alone first. Three radiologists solved the cases without, and then with the assistance of Gemini 1.5 Pro. An independent radiologist evaluated the quality of the image descriptions provided by GPT-4o and Gemini for each case. Differences in correct answers between multimodal models and radiologists were analyzed using McNemar test.</div></div><div><h3>Results</h3><div>GPT-4o and Gemini 1.5 Pro outperformed radiologists using clinical context alone (mean accuracy, 34.0 % [18/53] and 44.7 % [23.7/53] <em>vs</em>. 16.4 % [8.7/53]; both <em>P</em> < 0.01). Radiologists outperformed GPT-4o and Gemini 1.5 Pro using images alone (mean accuracy, 42.0 % [22.3/53] <em>vs</em>. 3.8 % [2/53], and 7.5 % [4/53]; both <em>P</em> < 0.01) and the complete cases (48.0 % [25.6/53] <em>vs</em>. 34.0 % [18/53], and 38.7 % [20.3/53]; both <em>P</em> < 0.001). While radiologists improved their accuracy when combining multimodal information (from 42.1 % [22.3/53] for images alone to 50.3 % [26.7/53] for complete cases; <em>P</em> < 0.01), GPT-4o and Gemini 1.5 Pro did not benefit from the multimodal context (from 34.0 % [18/53] for text alone to 35.2 % [18.7/53] for complete cases for GPT-4o; <em>P</em> = 0.48, and from 44.7 % [23.7/53] to 42.8 % [22.7/53] for Gemini 1.5 Pro; <em>P</em> = 0.54). Radiologists benefited significantly from the suggestion of Gemini 1.5 Pro, increasing their accuracy from 47.2 % [25/53] to 56.0 % [27/53] (<em>P</em> < 0.01). Both GPT-4o and Gemini 1.5 Pro correctly identified the imaging modality in 53/53 (100 %) and 51/53 (96.2 %) cases, respectively, but frequently failed to identify key imaging findings (43/53 cases [81.1 %] with incorrect identification of key imaging findings for GPT-4o and 50/53 [94.3 %] for Gemini 1.5).</div></div><div><h3>Conclusion</h3><div>Radiologists show a specific ability to benefit from the integration of textual and visual information, whereas multimodal models mostly rely on the clinical context to suggest diagnoses.</div></div>\",\"PeriodicalId\":48656,\"journal\":{\"name\":\"Diagnostic and Interventional Imaging\",\"volume\":\"106 10\",\"pages\":\"Pages 345-352\"},\"PeriodicalIF\":8.1000,\"publicationDate\":\"2025-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Diagnostic and Interventional Imaging\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2211568425000968\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Diagnostic and Interventional Imaging","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2211568425000968","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}

引用次数: 0

摘要

目的：本研究的目的是比较两种多模式模型（gpt - 40和Gemini 1.5 Pro）与放射科医生在使用复杂的神经放射学病例时，单独从文本背景、单独的关键图像或两者结合产生鉴别诊断的能力。材料和方法：本回顾性研究包括2008年1月至2024年9月在放射学杂志上发表的“请诊断”系列神经放射学病例。这两种多模态模型被要求仅从文本上下文、关键图像或完整病例中提供三种鉴别诊断。六名获得委员会认证的神经放射学家在相同的环境下解决了这些病例，他们被随机分为两组：首先单独处理环境，首先单独处理图像。三位放射科医生在没有使用Gemini 1.5 Pro的情况下解决了这些病例。一位独立的放射科医生评估了gpt - 40和Gemini为每个病例提供的图像描述的质量。采用McNemar检验分析多模态模型与放射科医师正确答案的差异。结果：gpt - 40和Gemini 1.5 Pro优于单独使用临床背景的放射科医生(平均准确率分别为34.0%[18/53]和44.7% [23.7/53]vs. 16.4% [8.7/53]；P < 0.01)。放射科医生单独使用图像的表现优于gpt - 40和Gemini 1.5 Pro(平均准确率为42.0%[22.3/53]，3.8%[2/53]和7.5% [4/53]；P < 0.01)和完全病例（48.0% [25.6/53]vs. 34.0%[18/53]和38.7% [20.3/53]）；P均< 0.001)。而放射科医生在结合多模态信息时提高了准确率(从单独图像的42.1%[22.3/53]提高到完整病例的50.3% [26.7/53]；P < 0.01)， gpt - 40和Gemini 1.5 Pro没有从多模态环境中获益(从单纯文本的34.0%[18/53]到完整病例的35.2% [18.7/53]；P = 0.48, Gemini 1.5 Pro从44.7%[23.7/53]降至42.8% [22.7/53]；P = 0.54)。使用Gemini 1.5 Pro后，放射科医生的准确率从47.2%[25/53]提高到56.0% [27/53]（P < 0.01）。gpt - 40和Gemini 1.5 Pro分别在53/53（100%）和51/53（96.2%）的病例中正确识别成像方式，但经常不能识别关键影像学表现（43/53例[81.1%]，gpt - 40和Gemini 1.5的50/53[94.3%]不能正确识别关键影像学表现）。结论：放射科医生表现出从文本和视觉信息的整合中获益的特殊能力，而多模态模型主要依赖于临床背景来建议诊断。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Comparison between multimodal foundation models and radiologists for the diagnosis of challenging neuroradiology cases with text and images

Purpose

The purpose of this study was to compare the ability of two multimodal models (GPT-4o and Gemini 1.5 Pro) with that of radiologists to generate differential diagnoses from textual context alone, key images alone, or a combination of both using complex neuroradiology cases.

Materials and methods

This retrospective study included neuroradiology cases from the "Diagnosis Please" series published in the Radiology journal between January 2008 and September 2024. The two multimodal models were asked to provide three differential diagnoses from textual context alone, key images alone, or the complete case. Six board-certified neuroradiologists solved the cases in the same setting, randomly assigned to two groups: context alone first and images alone first. Three radiologists solved the cases without, and then with the assistance of Gemini 1.5 Pro. An independent radiologist evaluated the quality of the image descriptions provided by GPT-4o and Gemini for each case. Differences in correct answers between multimodal models and radiologists were analyzed using McNemar test.

Results

GPT-4o and Gemini 1.5 Pro outperformed radiologists using clinical context alone (mean accuracy, 34.0 % [18/53] and 44.7 % [23.7/53] vs. 16.4 % [8.7/53]; both P < 0.01). Radiologists outperformed GPT-4o and Gemini 1.5 Pro using images alone (mean accuracy, 42.0 % [22.3/53] vs. 3.8 % [2/53], and 7.5 % [4/53]; both P < 0.01) and the complete cases (48.0 % [25.6/53] vs. 34.0 % [18/53], and 38.7 % [20.3/53]; both P < 0.001). While radiologists improved their accuracy when combining multimodal information (from 42.1 % [22.3/53] for images alone to 50.3 % [26.7/53] for complete cases; P < 0.01), GPT-4o and Gemini 1.5 Pro did not benefit from the multimodal context (from 34.0 % [18/53] for text alone to 35.2 % [18.7/53] for complete cases for GPT-4o; P = 0.48, and from 44.7 % [23.7/53] to 42.8 % [22.7/53] for Gemini 1.5 Pro; P = 0.54). Radiologists benefited significantly from the suggestion of Gemini 1.5 Pro, increasing their accuracy from 47.2 % [25/53] to 56.0 % [27/53] (P < 0.01). Both GPT-4o and Gemini 1.5 Pro correctly identified the imaging modality in 53/53 (100 %) and 51/53 (96.2 %) cases, respectively, but frequently failed to identify key imaging findings (43/53 cases [81.1 %] with incorrect identification of key imaging findings for GPT-4o and 50/53 [94.3 %] for Gemini 1.5).

Conclusion

Radiologists show a specific ability to benefit from the integration of textual and visual information, whereas multimodal models mostly rely on the clinical context to suggest diagnoses.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Diagnostic and Interventional Imaging Medicine-Radiology, Nuclear Medicine and Imaging

CiteScore

8.50

自引率

29.10%

发文量

126

审稿时长

11 days

期刊介绍： Diagnostic and Interventional Imaging accepts publications originating from any part of the world based only on their scientific merit. The Journal focuses on illustrated articles with great iconographic topics and aims at aiding sharpening clinical decision-making skills as well as following high research topics. All articles are published in English. Diagnostic and Interventional Imaging publishes editorials, technical notes, letters, original and review articles on abdominal, breast, cancer, cardiac, emergency, forensic medicine, head and neck, musculoskeletal, gastrointestinal, genitourinary, interventional, obstetric, pediatric, thoracic and vascular imaging, neuroradiology, nuclear medicine, as well as contrast material, computer developments, health policies and practice, and medical physics relevant to imaging.