{"title":"评估ChatGPT在放射学亚专科的表现:委员会式检查准确性和可变性的荟萃分析","authors":"Dan Nguyen , Grace Hyun J. Kim , Arash Bedayat","doi":"10.1016/j.clinimag.2025.110551","DOIUrl":null,"url":null,"abstract":"<div><h3>Introduction</h3><div>Large language models (LLMs) like ChatGPT are increasingly used in medicine due to their ability to synthesize information and support clinical decision-making. While prior research has evaluated ChatGPT's performance on medical board exams, limited data exist on radiology-specific exams especially considering prompt strategies and input modalities. This meta-analysis reviews ChatGPT's performance on radiology board-style questions, assessing accuracy across radiology subspecialties, prompt engineering methods, GPT model versions, and input modalities.</div></div><div><h3>Methods</h3><div>Searches in PubMed and SCOPUS identified 163 articles, of which 16 met inclusion criteria after excluding irrelevant topics and non-board exam evaluations. Data extracted included subspecialty topics, accuracy, question count, GPT model, input modality, prompting strategies, and access dates. Statistical analyses included two-proportion z-tests, a binomial generalized linear model (GLM), and meta-regression with random effects (Stata v18.0, R v4.3.1).</div></div><div><h3>Results</h3><div>Across 7024 questions, overall accuracy was 58.83 % (95 % CI, 55.53–62.13). Performance varied widely by subspecialty, highest in emergency radiology (73.00 %) and lowest in musculoskeletal radiology (49.24 %). GPT-4 and GPT-4o significantly outperformed GPT-3.5 (<em>p</em> < .001), but visual inputs yielded lower accuracy (46.52 %) compared to textual inputs (67.10 %, <em>p</em> < .001). Prompting strategies showed significant improvement (<em>p</em> < .01) with basic prompts (66.23 %) compared to no prompts (59.70 %). A modest but significant decline in performance over time was also observed (<em>p</em> < .001).</div></div><div><h3>Discussion</h3><div>ChatGPT demonstrates promising but inconsistent performance in radiology board-style questions. Limitations in visual reasoning, heterogeneity across studies, and prompt engineering variability highlight areas requiring targeted optimization.</div></div>","PeriodicalId":50680,"journal":{"name":"Clinical Imaging","volume":"125 ","pages":"Article 110551"},"PeriodicalIF":1.5000,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluating ChatGPT's performance across radiology subspecialties: A meta-analysis of board-style examination accuracy and variability\",\"authors\":\"Dan Nguyen , Grace Hyun J. Kim , Arash Bedayat\",\"doi\":\"10.1016/j.clinimag.2025.110551\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Introduction</h3><div>Large language models (LLMs) like ChatGPT are increasingly used in medicine due to their ability to synthesize information and support clinical decision-making. While prior research has evaluated ChatGPT's performance on medical board exams, limited data exist on radiology-specific exams especially considering prompt strategies and input modalities. This meta-analysis reviews ChatGPT's performance on radiology board-style questions, assessing accuracy across radiology subspecialties, prompt engineering methods, GPT model versions, and input modalities.</div></div><div><h3>Methods</h3><div>Searches in PubMed and SCOPUS identified 163 articles, of which 16 met inclusion criteria after excluding irrelevant topics and non-board exam evaluations. Data extracted included subspecialty topics, accuracy, question count, GPT model, input modality, prompting strategies, and access dates. Statistical analyses included two-proportion z-tests, a binomial generalized linear model (GLM), and meta-regression with random effects (Stata v18.0, R v4.3.1).</div></div><div><h3>Results</h3><div>Across 7024 questions, overall accuracy was 58.83 % (95 % CI, 55.53–62.13). Performance varied widely by subspecialty, highest in emergency radiology (73.00 %) and lowest in musculoskeletal radiology (49.24 %). GPT-4 and GPT-4o significantly outperformed GPT-3.5 (<em>p</em> < .001), but visual inputs yielded lower accuracy (46.52 %) compared to textual inputs (67.10 %, <em>p</em> < .001). Prompting strategies showed significant improvement (<em>p</em> < .01) with basic prompts (66.23 %) compared to no prompts (59.70 %). A modest but significant decline in performance over time was also observed (<em>p</em> < .001).</div></div><div><h3>Discussion</h3><div>ChatGPT demonstrates promising but inconsistent performance in radiology board-style questions. Limitations in visual reasoning, heterogeneity across studies, and prompt engineering variability highlight areas requiring targeted optimization.</div></div>\",\"PeriodicalId\":50680,\"journal\":{\"name\":\"Clinical Imaging\",\"volume\":\"125 \",\"pages\":\"Article 110551\"},\"PeriodicalIF\":1.5000,\"publicationDate\":\"2025-06-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Clinical Imaging\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0899707125001512\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Clinical Imaging","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0899707125001512","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
Evaluating ChatGPT's performance across radiology subspecialties: A meta-analysis of board-style examination accuracy and variability
Introduction
Large language models (LLMs) like ChatGPT are increasingly used in medicine due to their ability to synthesize information and support clinical decision-making. While prior research has evaluated ChatGPT's performance on medical board exams, limited data exist on radiology-specific exams especially considering prompt strategies and input modalities. This meta-analysis reviews ChatGPT's performance on radiology board-style questions, assessing accuracy across radiology subspecialties, prompt engineering methods, GPT model versions, and input modalities.
Methods
Searches in PubMed and SCOPUS identified 163 articles, of which 16 met inclusion criteria after excluding irrelevant topics and non-board exam evaluations. Data extracted included subspecialty topics, accuracy, question count, GPT model, input modality, prompting strategies, and access dates. Statistical analyses included two-proportion z-tests, a binomial generalized linear model (GLM), and meta-regression with random effects (Stata v18.0, R v4.3.1).
Results
Across 7024 questions, overall accuracy was 58.83 % (95 % CI, 55.53–62.13). Performance varied widely by subspecialty, highest in emergency radiology (73.00 %) and lowest in musculoskeletal radiology (49.24 %). GPT-4 and GPT-4o significantly outperformed GPT-3.5 (p < .001), but visual inputs yielded lower accuracy (46.52 %) compared to textual inputs (67.10 %, p < .001). Prompting strategies showed significant improvement (p < .01) with basic prompts (66.23 %) compared to no prompts (59.70 %). A modest but significant decline in performance over time was also observed (p < .001).
Discussion
ChatGPT demonstrates promising but inconsistent performance in radiology board-style questions. Limitations in visual reasoning, heterogeneity across studies, and prompt engineering variability highlight areas requiring targeted optimization.
期刊介绍:
The mission of Clinical Imaging is to publish, in a timely manner, the very best radiology research from the United States and around the world with special attention to the impact of medical imaging on patient care. The journal''s publications cover all imaging modalities, radiology issues related to patients, policy and practice improvements, and clinically-oriented imaging physics and informatics. The journal is a valuable resource for practicing radiologists, radiologists-in-training and other clinicians with an interest in imaging. Papers are carefully peer-reviewed and selected by our experienced subject editors who are leading experts spanning the range of imaging sub-specialties, which include:
-Body Imaging-
Breast Imaging-
Cardiothoracic Imaging-
Imaging Physics and Informatics-
Molecular Imaging and Nuclear Medicine-
Musculoskeletal and Emergency Imaging-
Neuroradiology-
Practice, Policy & Education-
Pediatric Imaging-
Vascular and Interventional Radiology