Assessing the Capability of ChatGPT, Google Bard, and Microsoft Bing in Solving Radiology Case Vignettes

IF 1 Q4 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING

Indian Journal of Radiology and Imaging Pub Date : 2023-12-29 DOI:10.1055/s-0043-1777746

Pradosh Kumar Sarangi, Ravi Kant Narayan, S. Mohakud, Aditi Vats, Debabrata Sahani, Himel Mondal

{"title":"Assessing the Capability of ChatGPT, Google Bard, and Microsoft Bing in Solving Radiology Case Vignettes","authors":"Pradosh Kumar Sarangi, Ravi Kant Narayan, S. Mohakud, Aditi Vats, Debabrata Sahani, Himel Mondal","doi":"10.1055/s-0043-1777746","DOIUrl":null,"url":null,"abstract":"Abstract Background The field of radiology relies on accurate interpretation of medical images for effective diagnosis and patient care. Recent advancements in artificial intelligence (AI) and natural language processing have sparked interest in exploring the potential of AI models in assisting radiologists. However, limited research has been conducted to assess the performance of AI models in radiology case interpretation, particularly in comparison to human experts. Objective This study aimed to evaluate the performance of ChatGPT, Google Bard, and Bing in solving radiology case vignettes (Fellowship of the Royal College of Radiologists 2A [FRCR2A] examination style questions) by comparing their responses to those provided by two radiology residents. Methods A total of 120 multiple-choice questions based on radiology case vignettes were formulated according to the pattern of FRCR2A examination. The questions were presented to ChatGPT, Google Bard, and Bing. Two residents wrote the examination with the same questions in 3 hours. The responses generated by the AI models were collected and compared to the answer keys and explanation of the answers was rated by the two radiologists. A cutoff of 60% was set as the passing score. Results The two residents (63.33 and 57.5%) outperformed the three AI models: Bard (44.17%), Bing (53.33%), and ChatGPT (45%), but only one resident passed the examination. The response patterns among the five respondents were significantly different ( p = 0.0117). In addition, the agreement among the generative AI models was significant (intraclass correlation coefficient [ICC] = 0.628), but there was no agreement between the residents (Kappa = –0.376). The explanation of generative AI models in support of answer was 44.72% accurate. Conclusion Humans exhibited superior accuracy compared to the AI models, showcasing a stronger comprehension of the subject matter. All three AI models included in the study could not achieve the minimum percentage needed to pass an FRCR2A examination. However, generative AI models showed significant agreement in their answers where the residents exhibited low agreement, highlighting a lack of consistency in their responses.","PeriodicalId":51597,"journal":{"name":"Indian Journal of Radiology and Imaging","volume":" 33","pages":""},"PeriodicalIF":1.0000,"publicationDate":"2023-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Indian Journal of Radiology and Imaging","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1055/s-0043-1777746","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}

引用次数: 0

Abstract

Abstract Background The field of radiology relies on accurate interpretation of medical images for effective diagnosis and patient care. Recent advancements in artificial intelligence (AI) and natural language processing have sparked interest in exploring the potential of AI models in assisting radiologists. However, limited research has been conducted to assess the performance of AI models in radiology case interpretation, particularly in comparison to human experts. Objective This study aimed to evaluate the performance of ChatGPT, Google Bard, and Bing in solving radiology case vignettes (Fellowship of the Royal College of Radiologists 2A [FRCR2A] examination style questions) by comparing their responses to those provided by two radiology residents. Methods A total of 120 multiple-choice questions based on radiology case vignettes were formulated according to the pattern of FRCR2A examination. The questions were presented to ChatGPT, Google Bard, and Bing. Two residents wrote the examination with the same questions in 3 hours. The responses generated by the AI models were collected and compared to the answer keys and explanation of the answers was rated by the two radiologists. A cutoff of 60% was set as the passing score. Results The two residents (63.33 and 57.5%) outperformed the three AI models: Bard (44.17%), Bing (53.33%), and ChatGPT (45%), but only one resident passed the examination. The response patterns among the five respondents were significantly different ( p = 0.0117). In addition, the agreement among the generative AI models was significant (intraclass correlation coefficient [ICC] = 0.628), but there was no agreement between the residents (Kappa = –0.376). The explanation of generative AI models in support of answer was 44.72% accurate. Conclusion Humans exhibited superior accuracy compared to the AI models, showcasing a stronger comprehension of the subject matter. All three AI models included in the study could not achieve the minimum percentage needed to pass an FRCR2A examination. However, generative AI models showed significant agreement in their answers where the residents exhibited low agreement, highlighting a lack of consistency in their responses.

查看原文本刊更多论文

评估 ChatGPT、Google Bard 和 Microsoft Bing 解决放射病例小故事的能力

摘要背景放射学领域依赖对医学影像的准确解读来进行有效的诊断和病人护理。人工智能（AI）和自然语言处理领域的最新进展激发了人们探索人工智能模型在协助放射科医生方面潜力的兴趣。然而，对人工智能模型在放射学病例解读中的表现进行评估的研究还很有限，尤其是与人类专家进行比较时。目的本研究旨在评估 ChatGPT、Google Bard 和 Bing 在解决放射学病例小故事（英国皇家放射医师学会 2A [FRCR2A] 考试风格的问题）时的表现，将它们的回答与两名放射学住院医师提供的回答进行比较。方法按照 FRCR2A 考试的模式，根据放射科病例小故事编制了 120 道选择题。这些问题分别呈现在 ChatGPT、Google Bard 和 Bing 上。两名住院医师在 3 小时内完成了相同问题的考试。两位放射科医生收集了人工智能模型生成的答案，并将其与答案密钥和答案解释进行了比较和评分。合格分数线设定为 60%。结果两名住院医师（63.33% 和 57.5%）的成绩优于三种人工智能模型：Bard（44.17%）、Bing（53.33%）和 ChatGPT（45%），但只有一名住院医师通过了考试。五位受访者的回答模式存在显著差异 ( p = 0.0117)。此外，生成式人工智能模型之间的一致性非常明显（类内相关系数 [ICC] = 0.628），但住院医师之间没有一致性（Kappa = -0.376）。生成式人工智能模型支持答案的解释准确率为 44.72%。结论与人工智能模型相比，人类表现出更高的准确性，对主题的理解能力也更强。研究中的三种人工智能模型都无法达到通过 FRCR2A 考试所需的最低百分比。不过，生成式人工智能模型在回答问题时表现出明显的一致性，而住院医生则表现出较低的一致性，这凸显出他们的回答缺乏一致性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊