Evaluating ChatGPT's performance across radiology subspecialties: A meta-analysis of board-style examination accuracy and variability

IF 1.5 4区 医学 Q3 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING
Dan Nguyen , Grace Hyun J. Kim , Arash Bedayat
{"title":"Evaluating ChatGPT's performance across radiology subspecialties: A meta-analysis of board-style examination accuracy and variability","authors":"Dan Nguyen ,&nbsp;Grace Hyun J. Kim ,&nbsp;Arash Bedayat","doi":"10.1016/j.clinimag.2025.110551","DOIUrl":null,"url":null,"abstract":"<div><h3>Introduction</h3><div>Large language models (LLMs) like ChatGPT are increasingly used in medicine due to their ability to synthesize information and support clinical decision-making. While prior research has evaluated ChatGPT's performance on medical board exams, limited data exist on radiology-specific exams especially considering prompt strategies and input modalities. This meta-analysis reviews ChatGPT's performance on radiology board-style questions, assessing accuracy across radiology subspecialties, prompt engineering methods, GPT model versions, and input modalities.</div></div><div><h3>Methods</h3><div>Searches in PubMed and SCOPUS identified 163 articles, of which 16 met inclusion criteria after excluding irrelevant topics and non-board exam evaluations. Data extracted included subspecialty topics, accuracy, question count, GPT model, input modality, prompting strategies, and access dates. Statistical analyses included two-proportion z-tests, a binomial generalized linear model (GLM), and meta-regression with random effects (Stata v18.0, R v4.3.1).</div></div><div><h3>Results</h3><div>Across 7024 questions, overall accuracy was 58.83 % (95 % CI, 55.53–62.13). Performance varied widely by subspecialty, highest in emergency radiology (73.00 %) and lowest in musculoskeletal radiology (49.24 %). GPT-4 and GPT-4o significantly outperformed GPT-3.5 (<em>p</em> &lt; .001), but visual inputs yielded lower accuracy (46.52 %) compared to textual inputs (67.10 %, <em>p</em> &lt; .001). Prompting strategies showed significant improvement (<em>p</em> &lt; .01) with basic prompts (66.23 %) compared to no prompts (59.70 %). A modest but significant decline in performance over time was also observed (<em>p</em> &lt; .001).</div></div><div><h3>Discussion</h3><div>ChatGPT demonstrates promising but inconsistent performance in radiology board-style questions. Limitations in visual reasoning, heterogeneity across studies, and prompt engineering variability highlight areas requiring targeted optimization.</div></div>","PeriodicalId":50680,"journal":{"name":"Clinical Imaging","volume":"125 ","pages":"Article 110551"},"PeriodicalIF":1.5000,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Clinical Imaging","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0899707125001512","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0

Abstract

Introduction

Large language models (LLMs) like ChatGPT are increasingly used in medicine due to their ability to synthesize information and support clinical decision-making. While prior research has evaluated ChatGPT's performance on medical board exams, limited data exist on radiology-specific exams especially considering prompt strategies and input modalities. This meta-analysis reviews ChatGPT's performance on radiology board-style questions, assessing accuracy across radiology subspecialties, prompt engineering methods, GPT model versions, and input modalities.

Methods

Searches in PubMed and SCOPUS identified 163 articles, of which 16 met inclusion criteria after excluding irrelevant topics and non-board exam evaluations. Data extracted included subspecialty topics, accuracy, question count, GPT model, input modality, prompting strategies, and access dates. Statistical analyses included two-proportion z-tests, a binomial generalized linear model (GLM), and meta-regression with random effects (Stata v18.0, R v4.3.1).

Results

Across 7024 questions, overall accuracy was 58.83 % (95 % CI, 55.53–62.13). Performance varied widely by subspecialty, highest in emergency radiology (73.00 %) and lowest in musculoskeletal radiology (49.24 %). GPT-4 and GPT-4o significantly outperformed GPT-3.5 (p < .001), but visual inputs yielded lower accuracy (46.52 %) compared to textual inputs (67.10 %, p < .001). Prompting strategies showed significant improvement (p < .01) with basic prompts (66.23 %) compared to no prompts (59.70 %). A modest but significant decline in performance over time was also observed (p < .001).

Discussion

ChatGPT demonstrates promising but inconsistent performance in radiology board-style questions. Limitations in visual reasoning, heterogeneity across studies, and prompt engineering variability highlight areas requiring targeted optimization.
评估ChatGPT在放射学亚专科的表现:委员会式检查准确性和可变性的荟萃分析
像ChatGPT这样的大型语言模型(llm)由于其综合信息和支持临床决策的能力而越来越多地应用于医学。虽然先前的研究已经评估了ChatGPT在医学委员会考试中的表现,但关于放射学特定考试的数据有限,特别是考虑到及时的策略和输入方式。本荟萃分析回顾了ChatGPT在放射学委员会式问题上的表现,评估了放射学亚专业、快速工程方法、GPT模型版本和输入方式的准确性。方法在PubMed和SCOPUS中检索163篇文献,排除不相关主题和非委员会考试评价后,其中16篇符合纳入标准。提取的数据包括子专业主题、准确性、问题数、GPT模型、输入方式、提示策略和访问日期。统计分析包括双比例z检验、二项广义线性模型(GLM)和随机效应元回归(Stata v18.0, R v4.3.1)。结果在7024个问题中,总体准确率为58.83% (95% CI, 55.53-62.13)。不同专科的表现差异很大,急诊放射学最高(73.00%),肌肉骨骼放射学最低(49.24%)。GPT-4和gpt - 40显著优于GPT-3.5 (p <;.001),但视觉输入的准确率(46.52%)低于文本输入(67.10%,p <;措施)。提示策略有显著改善(p <;有基本提示(66.23%)和无提示(59.70%)。随着时间的推移,也观察到表现适度但显著的下降(p <;措施)。讨论chatgpt在放射学委员会式问题中表现出有希望但不一致的表现。视觉推理的局限性、研究的异质性和快速的工程可变性突出了需要针对性优化的领域。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Clinical Imaging
Clinical Imaging 医学-核医学
CiteScore
4.60
自引率
0.00%
发文量
265
审稿时长
35 days
期刊介绍: The mission of Clinical Imaging is to publish, in a timely manner, the very best radiology research from the United States and around the world with special attention to the impact of medical imaging on patient care. The journal''s publications cover all imaging modalities, radiology issues related to patients, policy and practice improvements, and clinically-oriented imaging physics and informatics. The journal is a valuable resource for practicing radiologists, radiologists-in-training and other clinicians with an interest in imaging. Papers are carefully peer-reviewed and selected by our experienced subject editors who are leading experts spanning the range of imaging sub-specialties, which include: -Body Imaging- Breast Imaging- Cardiothoracic Imaging- Imaging Physics and Informatics- Molecular Imaging and Nuclear Medicine- Musculoskeletal and Emergency Imaging- Neuroradiology- Practice, Policy & Education- Pediatric Imaging- Vascular and Interventional Radiology
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信