Can ChatGPT4-vision identify radiologic progression of multiple sclerosis on brain MRI?

IF 3.6 Q1 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING

European Radiology Experimental Pub Date : 2025-01-15 DOI:10.1186/s41747-024-00547-w

Brendan S Kelly, Sophie Duignan, Prateek Mathur, Henry Dillon, Edward H Lee, Kristen W Yeom, Pearse A Keane, Aonghus Lawlor, Ronan P Killeen

{"title":"Can ChatGPT4-vision identify radiologic progression of multiple sclerosis on brain MRI?","authors":"Brendan S Kelly, Sophie Duignan, Prateek Mathur, Henry Dillon, Edward H Lee, Kristen W Yeom, Pearse A Keane, Aonghus Lawlor, Ronan P Killeen","doi":"10.1186/s41747-024-00547-w","DOIUrl":null,"url":null,"abstract":"Background: The large language model ChatGPT can now accept image input with the GPT4-vision (GPT4V) version. We aimed to compare the performance of GPT4V to pretrained U-Net and vision transformer (ViT) models for the identification of the progression of multiple sclerosis (MS) on magnetic resonance imaging (MRI).Methods: Paired coregistered MR images with and without progression were provided as input to ChatGPT4V in a zero-shot experiment to identify radiologic progression. Its performance was compared to pretrained U-Net and ViT models. Accuracy was the primary evaluation metric and 95% confidence interval (CIs) were calculated by bootstrapping. We included 170 patients with MS (50 males, 120 females), aged 21-74 years (mean 42.3), imaged at a single institution from 2019 to 2021, each with 2-5 MRI studies (496 in total).Results: One hundred seventy patients were included, 110 for training, 30 for tuning, and 30 for testing; 100 unseen paired images were randomly selected from the test set for evaluation. Both U-Net and ViT had 94% (95% CI: 89-98%) accuracy while GPT4V had 85% (77-91%). GPT4V gave cautious nonanswers in six cases. GPT4V had precision (specificity), recall (sensitivity), and F1 score of 89% (75-93%), 92% (82-98%), 91 (82-97%) compared to 100% (100-100%), 88 (78-96%), and 0.94 (88-98%) for U-Net and 94% (87-100%), 94 (88-100%), and 94 (89-98%) for ViT.Conclusion: The performance of GPT4V combined with its accessibility suggests has the potential to impact AI radiology research. However, misclassified cases and overly cautious non-answers confirm that it is not yet ready for clinical use.Relevance statement: GPT4V can identify the radiologic progression of MS in a simplified experimental setting. However, GPT4V is not a medical device, and its widespread availability highlights the need for caution and education for lay users, especially those with limited access to expert healthcare.Key points: Without fine-tuning or the need for prior coding experience, GPT4V can perform a zero-shot radiologic change detection task with reasonable accuracy. However, in absolute terms, in a simplified \"spot the difference\" medical imaging task, GPT4V was inferior to state-of-the-art computer vision methods. GPT4V's performance metrics were more similar to the ViT than the U-net. This is an exploratory experimental study and GPT4V is not intended for use as a medical device.","PeriodicalId":36926,"journal":{"name":"European Radiology Experimental","volume":"9 1","pages":"9"},"PeriodicalIF":3.6000,"publicationDate":"2025-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11735712/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Radiology Experimental","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1186/s41747-024-00547-w","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}

引用次数: 0

Abstract

Background: The large language model ChatGPT can now accept image input with the GPT4-vision (GPT4V) version. We aimed to compare the performance of GPT4V to pretrained U-Net and vision transformer (ViT) models for the identification of the progression of multiple sclerosis (MS) on magnetic resonance imaging (MRI).

Methods: Paired coregistered MR images with and without progression were provided as input to ChatGPT4V in a zero-shot experiment to identify radiologic progression. Its performance was compared to pretrained U-Net and ViT models. Accuracy was the primary evaluation metric and 95% confidence interval (CIs) were calculated by bootstrapping. We included 170 patients with MS (50 males, 120 females), aged 21-74 years (mean 42.3), imaged at a single institution from 2019 to 2021, each with 2-5 MRI studies (496 in total).

Results: One hundred seventy patients were included, 110 for training, 30 for tuning, and 30 for testing; 100 unseen paired images were randomly selected from the test set for evaluation. Both U-Net and ViT had 94% (95% CI: 89-98%) accuracy while GPT4V had 85% (77-91%). GPT4V gave cautious nonanswers in six cases. GPT4V had precision (specificity), recall (sensitivity), and F1 score of 89% (75-93%), 92% (82-98%), 91 (82-97%) compared to 100% (100-100%), 88 (78-96%), and 0.94 (88-98%) for U-Net and 94% (87-100%), 94 (88-100%), and 94 (89-98%) for ViT.

Conclusion: The performance of GPT4V combined with its accessibility suggests has the potential to impact AI radiology research. However, misclassified cases and overly cautious non-answers confirm that it is not yet ready for clinical use.

Relevance statement: GPT4V can identify the radiologic progression of MS in a simplified experimental setting. However, GPT4V is not a medical device, and its widespread availability highlights the need for caution and education for lay users, especially those with limited access to expert healthcare.

Key points: Without fine-tuning or the need for prior coding experience, GPT4V can perform a zero-shot radiologic change detection task with reasonable accuracy. However, in absolute terms, in a simplified "spot the difference" medical imaging task, GPT4V was inferior to state-of-the-art computer vision methods. GPT4V's performance metrics were more similar to the ViT than the U-net. This is an exploratory experimental study and GPT4V is not intended for use as a medical device.

Abstract Image

查看原文本刊更多论文

ChatGPT4-vision能否在脑MRI上识别多发性硬化症的影像学进展？

背景：大型语言模型ChatGPT现在可以接受GPT4-vision （GPT4V）版本的图像输入。我们的目的是比较GPT4V与预训练U-Net和视觉变压器（ViT）模型在磁共振成像（MRI）上识别多发性硬化症（MS）进展的性能。方法：在零射击实验中，将有进展和无进展的配对共配MR图像作为ChatGPT4V的输入，以识别放射学进展。将其性能与预训练的U-Net和ViT模型进行了比较。准确度为主要评价指标，95%置信区间（ci）采用自举法计算。我们纳入了170例MS患者（男性50例，女性120例），年龄21-74岁（平均42.3岁），于2019年至2021年在同一家机构进行了影像学检查，每人进行了2-5次MRI研究（共496次）。结果：纳入170例患者，110例用于训练，30例用于调整，30例用于测试；从测试集中随机选择100张未见的成对图像进行评估。U-Net和ViT准确率均为94% (95% CI: 89-98%)，而GPT4V准确率为85%（77-91%）。GPT4V在6个案例中给出了谨慎的不回答。GPT4V的精密度（特异性）、召回率（敏感性）和F1评分分别为89%（75-93%）、92%（82-98%）、91 (82-97%)，U-Net为100%（100-100%）、88（78-96%）和0.94 (88-98%)，ViT为94%（87-100%）、94（88-100%）和94（89-98%）。结论：GPT4V的性能及其可及性提示其具有影响人工智能放射学研究的潜力。然而，错误分类的病例和过于谨慎的不回答证实，它还没有准备好临床应用。相关性声明：GPT4V可以在简化的实验环境中识别MS的放射学进展。然而，GPT4V不是一种医疗设备，它的广泛可用性突出了对非专业用户的谨慎和教育的必要性，特别是那些获得专家医疗保健的机会有限的用户。重点：GPT4V无需微调，无需事先编码经验，可以以合理的精度完成零射击放射学变化检测任务。然而，从绝对意义上讲，在简化的“发现差异”医学成像任务中，GPT4V不如最先进的计算机视觉方法。GPT4V的性能指标更类似于ViT而不是U-net。这是一项探索性实验研究，GPT4V不打算用作医疗设备。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊