Assessing the Capability of ChatGPT, Google Bard, and Microsoft Bing in Solving Radiology Case Vignettes

IF 0.9 Q4 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING
Pradosh Kumar Sarangi, Ravi Kant Narayan, S. Mohakud, Aditi Vats, Debabrata Sahani, Himel Mondal
{"title":"Assessing the Capability of ChatGPT, Google Bard, and Microsoft Bing in Solving Radiology Case Vignettes","authors":"Pradosh Kumar Sarangi, Ravi Kant Narayan, S. Mohakud, Aditi Vats, Debabrata Sahani, Himel Mondal","doi":"10.1055/s-0043-1777746","DOIUrl":null,"url":null,"abstract":"Abstract Background  The field of radiology relies on accurate interpretation of medical images for effective diagnosis and patient care. Recent advancements in artificial intelligence (AI) and natural language processing have sparked interest in exploring the potential of AI models in assisting radiologists. However, limited research has been conducted to assess the performance of AI models in radiology case interpretation, particularly in comparison to human experts. Objective  This study aimed to evaluate the performance of ChatGPT, Google Bard, and Bing in solving radiology case vignettes (Fellowship of the Royal College of Radiologists 2A [FRCR2A] examination style questions) by comparing their responses to those provided by two radiology residents. Methods  A total of 120 multiple-choice questions based on radiology case vignettes were formulated according to the pattern of FRCR2A examination. The questions were presented to ChatGPT, Google Bard, and Bing. Two residents wrote the examination with the same questions in 3 hours. The responses generated by the AI models were collected and compared to the answer keys and explanation of the answers was rated by the two radiologists. A cutoff of 60% was set as the passing score. Results  The two residents (63.33 and 57.5%) outperformed the three AI models: Bard (44.17%), Bing (53.33%), and ChatGPT (45%), but only one resident passed the examination. The response patterns among the five respondents were significantly different ( p  = 0.0117). In addition, the agreement among the generative AI models was significant (intraclass correlation coefficient [ICC] = 0.628), but there was no agreement between the residents (Kappa = –0.376). The explanation of generative AI models in support of answer was 44.72% accurate. Conclusion  Humans exhibited superior accuracy compared to the AI models, showcasing a stronger comprehension of the subject matter. All three AI models included in the study could not achieve the minimum percentage needed to pass an FRCR2A examination. However, generative AI models showed significant agreement in their answers where the residents exhibited low agreement, highlighting a lack of consistency in their responses.","PeriodicalId":51597,"journal":{"name":"Indian Journal of Radiology and Imaging","volume":null,"pages":null},"PeriodicalIF":0.9000,"publicationDate":"2023-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Indian Journal of Radiology and Imaging","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1055/s-0043-1777746","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0

Abstract

Abstract Background  The field of radiology relies on accurate interpretation of medical images for effective diagnosis and patient care. Recent advancements in artificial intelligence (AI) and natural language processing have sparked interest in exploring the potential of AI models in assisting radiologists. However, limited research has been conducted to assess the performance of AI models in radiology case interpretation, particularly in comparison to human experts. Objective  This study aimed to evaluate the performance of ChatGPT, Google Bard, and Bing in solving radiology case vignettes (Fellowship of the Royal College of Radiologists 2A [FRCR2A] examination style questions) by comparing their responses to those provided by two radiology residents. Methods  A total of 120 multiple-choice questions based on radiology case vignettes were formulated according to the pattern of FRCR2A examination. The questions were presented to ChatGPT, Google Bard, and Bing. Two residents wrote the examination with the same questions in 3 hours. The responses generated by the AI models were collected and compared to the answer keys and explanation of the answers was rated by the two radiologists. A cutoff of 60% was set as the passing score. Results  The two residents (63.33 and 57.5%) outperformed the three AI models: Bard (44.17%), Bing (53.33%), and ChatGPT (45%), but only one resident passed the examination. The response patterns among the five respondents were significantly different ( p  = 0.0117). In addition, the agreement among the generative AI models was significant (intraclass correlation coefficient [ICC] = 0.628), but there was no agreement between the residents (Kappa = –0.376). The explanation of generative AI models in support of answer was 44.72% accurate. Conclusion  Humans exhibited superior accuracy compared to the AI models, showcasing a stronger comprehension of the subject matter. All three AI models included in the study could not achieve the minimum percentage needed to pass an FRCR2A examination. However, generative AI models showed significant agreement in their answers where the residents exhibited low agreement, highlighting a lack of consistency in their responses.
评估 ChatGPT、Google Bard 和 Microsoft Bing 解决放射病例小故事的能力
摘要 背景 放射学领域依赖对医学影像的准确解读来进行有效的诊断和病人护理。人工智能(AI)和自然语言处理领域的最新进展激发了人们探索人工智能模型在协助放射科医生方面潜力的兴趣。然而,对人工智能模型在放射学病例解读中的表现进行评估的研究还很有限,尤其是与人类专家进行比较时。目的 本研究旨在评估 ChatGPT、Google Bard 和 Bing 在解决放射学病例小故事(英国皇家放射医师学会 2A [FRCR2A] 考试风格的问题)时的表现,将它们的回答与两名放射学住院医师提供的回答进行比较。方法 按照 FRCR2A 考试的模式,根据放射科病例小故事编制了 120 道选择题。这些问题分别呈现在 ChatGPT、Google Bard 和 Bing 上。两名住院医师在 3 小时内完成了相同问题的考试。两位放射科医生收集了人工智能模型生成的答案,并将其与答案密钥和答案解释进行了比较和评分。合格分数线设定为 60%。结果 两名住院医师(63.33% 和 57.5%)的成绩优于三种人工智能模型:Bard(44.17%)、Bing(53.33%)和 ChatGPT(45%),但只有一名住院医师通过了考试。五位受访者的回答模式存在显著差异 ( p = 0.0117)。此外,生成式人工智能模型之间的一致性非常明显(类内相关系数 [ICC] = 0.628),但住院医师之间没有一致性(Kappa = -0.376)。生成式人工智能模型支持答案的解释准确率为 44.72%。结论 与人工智能模型相比,人类表现出更高的准确性,对主题的理解能力也更强。研究中的三种人工智能模型都无法达到通过 FRCR2A 考试所需的最低百分比。不过,生成式人工智能模型在回答问题时表现出明显的一致性,而住院医生则表现出较低的一致性,这凸显出他们的回答缺乏一致性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Indian Journal of Radiology and Imaging
Indian Journal of Radiology and Imaging RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING-
CiteScore
1.20
自引率
0.00%
发文量
115
审稿时长
45 weeks
期刊介绍: Information not localized
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信