Brendan S Kelly, Sophie Duignan, Prateek Mathur, Henry Dillon, Edward H Lee, Kristen W Yeom, Pearse A Keane, Aonghus Lawlor, Ronan P Killeen
{"title":"Can ChatGPT4-vision identify radiologic progression of multiple sclerosis on brain MRI?","authors":"Brendan S Kelly, Sophie Duignan, Prateek Mathur, Henry Dillon, Edward H Lee, Kristen W Yeom, Pearse A Keane, Aonghus Lawlor, Ronan P Killeen","doi":"10.1186/s41747-024-00547-w","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>The large language model ChatGPT can now accept image input with the GPT4-vision (GPT4V) version. We aimed to compare the performance of GPT4V to pretrained U-Net and vision transformer (ViT) models for the identification of the progression of multiple sclerosis (MS) on magnetic resonance imaging (MRI).</p><p><strong>Methods: </strong>Paired coregistered MR images with and without progression were provided as input to ChatGPT4V in a zero-shot experiment to identify radiologic progression. Its performance was compared to pretrained U-Net and ViT models. Accuracy was the primary evaluation metric and 95% confidence interval (CIs) were calculated by bootstrapping. We included 170 patients with MS (50 males, 120 females), aged 21-74 years (mean 42.3), imaged at a single institution from 2019 to 2021, each with 2-5 MRI studies (496 in total).</p><p><strong>Results: </strong>One hundred seventy patients were included, 110 for training, 30 for tuning, and 30 for testing; 100 unseen paired images were randomly selected from the test set for evaluation. Both U-Net and ViT had 94% (95% CI: 89-98%) accuracy while GPT4V had 85% (77-91%). GPT4V gave cautious nonanswers in six cases. GPT4V had precision (specificity), recall (sensitivity), and F1 score of 89% (75-93%), 92% (82-98%), 91 (82-97%) compared to 100% (100-100%), 88 (78-96%), and 0.94 (88-98%) for U-Net and 94% (87-100%), 94 (88-100%), and 94 (89-98%) for ViT.</p><p><strong>Conclusion: </strong>The performance of GPT4V combined with its accessibility suggests has the potential to impact AI radiology research. However, misclassified cases and overly cautious non-answers confirm that it is not yet ready for clinical use.</p><p><strong>Relevance statement: </strong>GPT4V can identify the radiologic progression of MS in a simplified experimental setting. However, GPT4V is not a medical device, and its widespread availability highlights the need for caution and education for lay users, especially those with limited access to expert healthcare.</p><p><strong>Key points: </strong>Without fine-tuning or the need for prior coding experience, GPT4V can perform a zero-shot radiologic change detection task with reasonable accuracy. However, in absolute terms, in a simplified \"spot the difference\" medical imaging task, GPT4V was inferior to state-of-the-art computer vision methods. GPT4V's performance metrics were more similar to the ViT than the U-net. This is an exploratory experimental study and GPT4V is not intended for use as a medical device.</p>","PeriodicalId":36926,"journal":{"name":"European Radiology Experimental","volume":"9 1","pages":"9"},"PeriodicalIF":3.7000,"publicationDate":"2025-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11735712/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Radiology Experimental","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1186/s41747-024-00547-w","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0
Abstract
Background: The large language model ChatGPT can now accept image input with the GPT4-vision (GPT4V) version. We aimed to compare the performance of GPT4V to pretrained U-Net and vision transformer (ViT) models for the identification of the progression of multiple sclerosis (MS) on magnetic resonance imaging (MRI).
Methods: Paired coregistered MR images with and without progression were provided as input to ChatGPT4V in a zero-shot experiment to identify radiologic progression. Its performance was compared to pretrained U-Net and ViT models. Accuracy was the primary evaluation metric and 95% confidence interval (CIs) were calculated by bootstrapping. We included 170 patients with MS (50 males, 120 females), aged 21-74 years (mean 42.3), imaged at a single institution from 2019 to 2021, each with 2-5 MRI studies (496 in total).
Results: One hundred seventy patients were included, 110 for training, 30 for tuning, and 30 for testing; 100 unseen paired images were randomly selected from the test set for evaluation. Both U-Net and ViT had 94% (95% CI: 89-98%) accuracy while GPT4V had 85% (77-91%). GPT4V gave cautious nonanswers in six cases. GPT4V had precision (specificity), recall (sensitivity), and F1 score of 89% (75-93%), 92% (82-98%), 91 (82-97%) compared to 100% (100-100%), 88 (78-96%), and 0.94 (88-98%) for U-Net and 94% (87-100%), 94 (88-100%), and 94 (89-98%) for ViT.
Conclusion: The performance of GPT4V combined with its accessibility suggests has the potential to impact AI radiology research. However, misclassified cases and overly cautious non-answers confirm that it is not yet ready for clinical use.
Relevance statement: GPT4V can identify the radiologic progression of MS in a simplified experimental setting. However, GPT4V is not a medical device, and its widespread availability highlights the need for caution and education for lay users, especially those with limited access to expert healthcare.
Key points: Without fine-tuning or the need for prior coding experience, GPT4V can perform a zero-shot radiologic change detection task with reasonable accuracy. However, in absolute terms, in a simplified "spot the difference" medical imaging task, GPT4V was inferior to state-of-the-art computer vision methods. GPT4V's performance metrics were more similar to the ViT than the U-net. This is an exploratory experimental study and GPT4V is not intended for use as a medical device.