Mitul Gupta , John Virostko , Christopher Kaufmann
{"title":"Large language models in radiology: Fluctuating performance and decreasing discordance over time","authors":"Mitul Gupta , John Virostko , Christopher Kaufmann","doi":"10.1016/j.ejrad.2024.111842","DOIUrl":null,"url":null,"abstract":"<div><h3>Objective</h3><div>Since the introduction of large language models (LLMs), near expert level performance in medical specialties such as radiology has been demonstrated. However, there is limited to no comparative information of model performance, accuracy, and reliability over time in these medical specialty domains. This study aims to evaluate and monitor the performance and internal reliability of LLMs in radiology over a three-month period.</div></div><div><h3>Methods</h3><div>LLMs (GPT-4, GPT-3.5, Claude, and Google Bard) were queried monthly from November 2023 to January 2024, utilizing ACR Diagnostic in Training Exam (DXIT) practice questions. Model overall accuracy and by subspecialty category was assessed over time. Internal consistency was evaluated through answer mismatch or intra-model discordance between trials.</div></div><div><h3>Results</h3><div>GPT-4 had the highest accuracy (78 ± 4.1 %), followed by Google Bard (73 ± 2.9 %), Claude (71 ± 1.5 %), and GPT-3.5 (63 ± 6.9 %). GPT-4 performed significantly better than GPT-3.5 (p = 0.031). Over time, GPT-4′s accuracy trended down (82 % to 74 %), while Claude’s accuracy increased (70 % to 73 %). Intra-model discordance rates decreased for all models, indicating improved response consistency. Performance varied by subspecialty, with significant differences in the Chest, Physics, Ultrasound, and Pediatrics sections. Models struggled with questions requiring detailed factual knowledge but performed better on broader interpretive questions.</div></div><div><h3>Conclusion</h3><div>LLMs, except GPT-3.5, performed above 70%, demonstrating substantial subject-specific knowledge. However, performance fluctuated over time, underscoring the need for continuous, radiology-specific standardized benchmarking metrics to gauge LLM reliability before clinical use. This study provides a foundational benchmark for future LLM performance evaluations in radiology.</div></div>","PeriodicalId":12063,"journal":{"name":"European Journal of Radiology","volume":"182 ","pages":"Article 111842"},"PeriodicalIF":3.2000,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Journal of Radiology","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0720048X24005588","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0
Abstract
Objective
Since the introduction of large language models (LLMs), near expert level performance in medical specialties such as radiology has been demonstrated. However, there is limited to no comparative information of model performance, accuracy, and reliability over time in these medical specialty domains. This study aims to evaluate and monitor the performance and internal reliability of LLMs in radiology over a three-month period.
Methods
LLMs (GPT-4, GPT-3.5, Claude, and Google Bard) were queried monthly from November 2023 to January 2024, utilizing ACR Diagnostic in Training Exam (DXIT) practice questions. Model overall accuracy and by subspecialty category was assessed over time. Internal consistency was evaluated through answer mismatch or intra-model discordance between trials.
Results
GPT-4 had the highest accuracy (78 ± 4.1 %), followed by Google Bard (73 ± 2.9 %), Claude (71 ± 1.5 %), and GPT-3.5 (63 ± 6.9 %). GPT-4 performed significantly better than GPT-3.5 (p = 0.031). Over time, GPT-4′s accuracy trended down (82 % to 74 %), while Claude’s accuracy increased (70 % to 73 %). Intra-model discordance rates decreased for all models, indicating improved response consistency. Performance varied by subspecialty, with significant differences in the Chest, Physics, Ultrasound, and Pediatrics sections. Models struggled with questions requiring detailed factual knowledge but performed better on broader interpretive questions.
Conclusion
LLMs, except GPT-3.5, performed above 70%, demonstrating substantial subject-specific knowledge. However, performance fluctuated over time, underscoring the need for continuous, radiology-specific standardized benchmarking metrics to gauge LLM reliability before clinical use. This study provides a foundational benchmark for future LLM performance evaluations in radiology.
期刊介绍:
European Journal of Radiology is an international journal which aims to communicate to its readers, state-of-the-art information on imaging developments in the form of high quality original research articles and timely reviews on current developments in the field.
Its audience includes clinicians at all levels of training including radiology trainees, newly qualified imaging specialists and the experienced radiologist. Its aim is to inform efficient, appropriate and evidence-based imaging practice to the benefit of patients worldwide.