Comparing Large Language Model and Human Reader Accuracy with New England Journal of Medicine Image Challenge Case Image Inputs.

IF 12.1 1区 医学 Q1 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING
Radiology Pub Date : 2024-12-01 DOI:10.1148/radiol.241668
Pae Sun Suh, Woo Hyun Shim, Chong Hyun Suh, Hwon Heo, Kye Jin Park, Pyeong Hwa Kim, Se Jin Choi, Yura Ahn, Sohee Park, Ho Young Park, Na Eun Oh, Min Woo Han, Sung Tan Cho, Chang-Yun Woo, Hyungjun Park
{"title":"Comparing Large Language Model and Human Reader Accuracy with <i>New England Journal of Medicine</i> Image Challenge Case Image Inputs.","authors":"Pae Sun Suh, Woo Hyun Shim, Chong Hyun Suh, Hwon Heo, Kye Jin Park, Pyeong Hwa Kim, Se Jin Choi, Yura Ahn, Sohee Park, Ho Young Park, Na Eun Oh, Min Woo Han, Sung Tan Cho, Chang-Yun Woo, Hyungjun Park","doi":"10.1148/radiol.241668","DOIUrl":null,"url":null,"abstract":"<p><p>Background Application of multimodal large language models (LLMs) with both textual and visual capabilities has been steadily increasing, but their ability to interpret radiologic images is still doubted. Purpose To evaluate the accuracy of LLMs and compare it with that of human readers with varying levels of experience and to assess the factors affecting LLM accuracy in answering <i>New England Journal of Medicine</i> Image Challenge cases. Materials and Methods Radiologic images of cases from October 13, 2005, to April 18, 2024, were retrospectively reviewed. Using text and image inputs, LLMs (Open AI's GPT-4 Turbo with Vision [GPT-4V] and GPT-4 Omni [GPT-4o], Google's DeepMind Gemini 1.5 Pro, and Anthropic's Claude 3) provided answers. Human readers (seven junior faculty radiologists, two clinicians, one in-training radiologist, and one medical student), blinded to the published answers, also answered. LLM accuracy with and without image inputs and short (cases from 2005 to 2015) versus long text inputs (from 2016 to 2024) was evaluated in subgroup analysis to determine the effect of these factors. Factor analysis was assessed using multivariable logistic regression. Accuracy was compared with generalized estimating equations, with multiple comparisons adjusted by using Bonferroni correction. Results A total of 272 cases were included. GPT-4o achieved the highest overall accuracy among LLMs (59.6%; 162 of 272), outperforming a medical student (47.1%; 128 of 272; <i>P</i> < .001) but not junior faculty (80.9%; 220 of 272; <i>P</i> < .001) or the in-training radiologist (70.2%; 191 of 272; <i>P</i> = .003). GPT-4o exhibited similar accuracy regardless of image inputs (without images vs with images, 54.0% [147 of 272] vs 59.6% [162 of 272], respectively; <i>P</i> = .59). Human reader accuracy was unaffected by text length, whereas LLMs demonstrated higher accuracy with long text inputs (all <i>P</i> < .001). Text input length affected LLM accuracy (odds ratio range, 3.2 [95% CI: 1.9, 5.5] to 6.6 [95% CI: 3.7, 12.0]). Conclusion LLMs demonstrated substantial accuracy with text and image inputs, outperforming a medical student. However, their accuracy decreased with shorter text lengths, regardless of image input. © RSNA, 2024 <i>Supplemental material is available for this article.</i></p>","PeriodicalId":20896,"journal":{"name":"Radiology","volume":"313 3","pages":"e241668"},"PeriodicalIF":12.1000,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Radiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1148/radiol.241668","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0

Abstract

Background Application of multimodal large language models (LLMs) with both textual and visual capabilities has been steadily increasing, but their ability to interpret radiologic images is still doubted. Purpose To evaluate the accuracy of LLMs and compare it with that of human readers with varying levels of experience and to assess the factors affecting LLM accuracy in answering New England Journal of Medicine Image Challenge cases. Materials and Methods Radiologic images of cases from October 13, 2005, to April 18, 2024, were retrospectively reviewed. Using text and image inputs, LLMs (Open AI's GPT-4 Turbo with Vision [GPT-4V] and GPT-4 Omni [GPT-4o], Google's DeepMind Gemini 1.5 Pro, and Anthropic's Claude 3) provided answers. Human readers (seven junior faculty radiologists, two clinicians, one in-training radiologist, and one medical student), blinded to the published answers, also answered. LLM accuracy with and without image inputs and short (cases from 2005 to 2015) versus long text inputs (from 2016 to 2024) was evaluated in subgroup analysis to determine the effect of these factors. Factor analysis was assessed using multivariable logistic regression. Accuracy was compared with generalized estimating equations, with multiple comparisons adjusted by using Bonferroni correction. Results A total of 272 cases were included. GPT-4o achieved the highest overall accuracy among LLMs (59.6%; 162 of 272), outperforming a medical student (47.1%; 128 of 272; P < .001) but not junior faculty (80.9%; 220 of 272; P < .001) or the in-training radiologist (70.2%; 191 of 272; P = .003). GPT-4o exhibited similar accuracy regardless of image inputs (without images vs with images, 54.0% [147 of 272] vs 59.6% [162 of 272], respectively; P = .59). Human reader accuracy was unaffected by text length, whereas LLMs demonstrated higher accuracy with long text inputs (all P < .001). Text input length affected LLM accuracy (odds ratio range, 3.2 [95% CI: 1.9, 5.5] to 6.6 [95% CI: 3.7, 12.0]). Conclusion LLMs demonstrated substantial accuracy with text and image inputs, outperforming a medical student. However, their accuracy decreased with shorter text lengths, regardless of image input. © RSNA, 2024 Supplemental material is available for this article.

求助全文
约1分钟内获得全文 求助全文
来源期刊
Radiology
Radiology 医学-核医学
CiteScore
35.20
自引率
3.00%
发文量
596
审稿时长
3.6 months
期刊介绍: Published regularly since 1923 by the Radiological Society of North America (RSNA), Radiology has long been recognized as the authoritative reference for the most current, clinically relevant and highest quality research in the field of radiology. Each month the journal publishes approximately 240 pages of peer-reviewed original research, authoritative reviews, well-balanced commentary on significant articles, and expert opinion on new techniques and technologies. Radiology publishes cutting edge and impactful imaging research articles in radiology and medical imaging in order to help improve human health.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信