Evaluation of GPT Large Language Model Performance on RSNA 2023 Case of the Day Questions.
IF 12.1
1区 医学
Q1 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING
Pritam Mukherjee, Benjamin Hou, Abhinav Suri, Yan Zhuang, Christopher Parnell, Nicholas Lee, Oana Stroie, Ravi Jain, Kenneth C Wang, Komal Sharma, Ronald M Summers
下载PDF
{"title":"Evaluation of GPT Large Language Model Performance on RSNA 2023 Case of the Day Questions.","authors":"Pritam Mukherjee, Benjamin Hou, Abhinav Suri, Yan Zhuang, Christopher Parnell, Nicholas Lee, Oana Stroie, Ravi Jain, Kenneth C Wang, Komal Sharma, Ronald M Summers","doi":"10.1148/radiol.240609","DOIUrl":null,"url":null,"abstract":"<p><p>Background GPT-4V (GPT-4 with vision, ChatGPT; OpenAI) has shown impressive performance in several medical assessments. However, few studies have assessed its performance in interpreting radiologic images. Purpose To assess and compare the accuracy of GPT-4V in assessing radiologic cases with both images and textual context to that of radiologists and residents, to assess if GPT-4V assistance improves human accuracy, and to assess and compare the accuracy of GPT-4V with that of image-only or text-only inputs. Materials and Methods Seventy-two Case of the Day questions at the RSNA 2023 Annual Meeting were curated in this observer study. Answers from GPT-4V were obtained between November 26 and December 10, 2023, with the following inputs for each question: image only, text only, and both text and images. Five radiologists and three residents also answered the questions in an \"open book\" setting. For the artificial intelligence (AI)-assisted portion, the radiologists and residents were provided with the outputs of GPT-4V. The accuracy of radiologists and residents, both with and without AI assistance, was analyzed using a mixed-effects linear model. The accuracies of GPT-4V with different input combinations were compared by using the McNemar test. <i>P</i> < .05 was considered to indicate a significant difference. Results The accuracy of GPT-4V was 43% (31 of 72; 95% CI: 32, 55). Radiologists and residents did not significantly outperform GPT-4V in either imaging-dependent (59% and 56% vs 39%; <i>P</i> = .31 and .52, respectively) or imaging-independent (76% and 63% vs 70%; both <i>P</i> = .99) cases. With access to GPT-4V responses, there was no evidence of improvement in the average accuracy of the readers. The accuracy obtained by GPT-4V with text-only and image-only inputs was 50% (35 of 70; 95% CI: 39, 61) and 38% (26 of 69; 95% CI: 27, 49), respectively. Conclusion The radiologists and residents did not significantly outperform GPT-4V. Assistance from GPT-4V did not help human raters. GPT-4V relied on the textual context for its outputs. © RSNA, 2024 <i>Supplemental material is available for this article.</i> See also the editorial by Katz in this issue.</p>","PeriodicalId":20896,"journal":{"name":"Radiology","volume":"313 1","pages":"e240609"},"PeriodicalIF":12.1000,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11535869/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Radiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1148/radiol.240609","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0
引用
批量引用
Abstract
Background GPT-4V (GPT-4 with vision, ChatGPT; OpenAI) has shown impressive performance in several medical assessments. However, few studies have assessed its performance in interpreting radiologic images. Purpose To assess and compare the accuracy of GPT-4V in assessing radiologic cases with both images and textual context to that of radiologists and residents, to assess if GPT-4V assistance improves human accuracy, and to assess and compare the accuracy of GPT-4V with that of image-only or text-only inputs. Materials and Methods Seventy-two Case of the Day questions at the RSNA 2023 Annual Meeting were curated in this observer study. Answers from GPT-4V were obtained between November 26 and December 10, 2023, with the following inputs for each question: image only, text only, and both text and images. Five radiologists and three residents also answered the questions in an "open book" setting. For the artificial intelligence (AI)-assisted portion, the radiologists and residents were provided with the outputs of GPT-4V. The accuracy of radiologists and residents, both with and without AI assistance, was analyzed using a mixed-effects linear model. The accuracies of GPT-4V with different input combinations were compared by using the McNemar test. P < .05 was considered to indicate a significant difference. Results The accuracy of GPT-4V was 43% (31 of 72; 95% CI: 32, 55). Radiologists and residents did not significantly outperform GPT-4V in either imaging-dependent (59% and 56% vs 39%; P = .31 and .52, respectively) or imaging-independent (76% and 63% vs 70%; both P = .99) cases. With access to GPT-4V responses, there was no evidence of improvement in the average accuracy of the readers. The accuracy obtained by GPT-4V with text-only and image-only inputs was 50% (35 of 70; 95% CI: 39, 61) and 38% (26 of 69; 95% CI: 27, 49), respectively. Conclusion The radiologists and residents did not significantly outperform GPT-4V. Assistance from GPT-4V did not help human raters. GPT-4V relied on the textual context for its outputs. © RSNA, 2024 Supplemental material is available for this article. See also the editorial by Katz in this issue.
在 RSNA 2023 "每日案例 "问题上评估 GPT 大语言模型性能。
背景 GPT-4V(GPT-4 with vision,ChatGPT;OpenAI)在多项医疗评估中表现出色。然而,很少有研究对其在解读放射图像方面的性能进行评估。目的 评估 GPT-4V 在评估放射病例时的准确性,并将其与放射科医生和住院医生的准确性进行比较;评估 GPT-4V 的辅助是否能提高人类的准确性;评估 GPT-4V 的准确性,并将其与纯图像或纯文本输入的准确性进行比较。材料与方法 在这项观察研究中,对 RSNA 2023 年年会上的 72 个每日病例问题进行了策划。在 2023 年 11 月 26 日至 12 月 10 日期间从 GPT-4V 获得了答案,每个问题的输入情况如下:仅图像、仅文本以及文本和图像。五位放射科医生和三位住院医生也在 "开卷 "的情况下回答了问题。在人工智能(AI)辅助部分,放射科医师和住院医师获得了 GPT-4V 的输出结果。使用混合效应线性模型分析了放射科医师和住院医师在人工智能辅助和无人工智能辅助情况下的准确性。使用 McNemar 检验比较了不同输入组合下 GPT-4V 的准确性。P < .05 表示差异显著。结果 GPT-4V 的准确率为 43%(72 人中有 31 人;95% CI:32,55)。放射科医生和住院医师在影像依赖性病例(59% 和 56% vs 39%;P = .31 和 .52,分别)或影像非依赖性病例(76% 和 63% vs 70%;P = .99)中的表现均未明显优于 GPT-4V。在获得 GPT-4V 应答后,没有证据表明阅读器的平均准确率有所提高。纯文字和纯图像输入的 GPT-4V 准确率分别为 50%(70 人中有 35 人;95% CI:39, 61)和 38%(69 人中有 26 人;95% CI:27, 49)。结论 放射科医生和住院医生的表现并没有明显优于 GPT-4V。GPT-4V 的协助对人类评分员没有帮助。GPT-4V 的输出依赖于文本上下文。© RSNA, 2024 这篇文章有补充材料。另请参阅本期卡茨的社论。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
来源期刊
期刊介绍:
Published regularly since 1923 by the Radiological Society of North America (RSNA), Radiology has long been recognized as the authoritative reference for the most current, clinically relevant and highest quality research in the field of radiology. Each month the journal publishes approximately 240 pages of peer-reviewed original research, authoritative reviews, well-balanced commentary on significant articles, and expert opinion on new techniques and technologies.
Radiology publishes cutting edge and impactful imaging research articles in radiology and medical imaging in order to help improve human health.