Can ChatGPT and Gemini justify brain CT referrals? A comparative study with human experts and a custom prediction model.

IF 3.6 Q1 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING

European Radiology Experimental Pub Date : 2025-02-18 DOI:10.1186/s41747-025-00569-y

Jaka Potočnik, Edel Thomas, Dearbhla Kearney, Ronan P Killeen, Eric J Heffernan, Shane J Foley

{"title":"Can ChatGPT and Gemini justify brain CT referrals? A comparative study with human experts and a custom prediction model.","authors":"Jaka Potočnik, Edel Thomas, Dearbhla Kearney, Ronan P Killeen, Eric J Heffernan, Shane J Foley","doi":"10.1186/s41747-025-00569-y","DOIUrl":null,"url":null,"abstract":"Background: The poor uptake of imaging referral guidelines in Europe results in a substantial amount of inappropriate computed tomography (CT) scans. Publicly available chatbots, ChatGPT and Gemini, offer an alternative for justifying real-world referrals. Recent research reports high ChatGPT accuracy when analysing American College of Radiology Appropriateness Criteria variants. We compared the chatbots' performance in interpreting, justifying, and suggesting alternative imaging for unstructured adult brain CT referrals in accordance with the European Society of Radiology iGuide. Our prediction model for automated iGuide categorisation of referrals was also compared against the chatbots.Methods: The iGuide justification of 143 real-world CT brain referrals, used to evaluate a prediction model, was analysed by two radiographers and radiologists. ChatGPT-4's and Gemini's imaging recommendations and pathology suspicions were compared with those of humans, with respect to referral completeness. Inter-rater reliability with κ statistics determined the agreement between entities.Results: Chatbots' performance was limited (κ = 0.3) but improved for more complete referrals. The prediction model outperformed the chatbots in justification analysis (κ = 0.853). The chatbots' interpretations of complete referrals were highly consistent (49/52, 94.2%). The agreement regarding alternative imaging was high for both complete and ambiguous referrals, with ChatGPT and Gemini correctly identifying imaging modality and anatomical region in 83/96 (86.5%) and 81/96 (84.4%) cases, respectively.Conclusion: The chatbots' ability to analyse the justification of adult brain CT referrals is limited to complete referrals, unlike our prediction model. Further research is needed to confirm these findings for other types of CT scans and modalities.Relevance statement: ChatGPT and Gemini exhibit potential in justifying free text brain CT referrals; however, further improvements are required to handle real-world referrals of varying quality.Key points: Custom prediction model's justification analysis strongly aligns with iGuide and surpasses chatbots. Chatbots incorrectly justified almost one-half of all CT brain referrals. Chatbots have limited performance in justifying ambiguous CT brain referrals. Chatbot performance improved when referrals were detailed and included suspected pathology.","PeriodicalId":36926,"journal":{"name":"European Radiology Experimental","volume":"9 1","pages":"24"},"PeriodicalIF":3.6000,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11836243/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Radiology Experimental","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1186/s41747-025-00569-y","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}

引用次数: 0

Abstract

Background: The poor uptake of imaging referral guidelines in Europe results in a substantial amount of inappropriate computed tomography (CT) scans. Publicly available chatbots, ChatGPT and Gemini, offer an alternative for justifying real-world referrals. Recent research reports high ChatGPT accuracy when analysing American College of Radiology Appropriateness Criteria variants. We compared the chatbots' performance in interpreting, justifying, and suggesting alternative imaging for unstructured adult brain CT referrals in accordance with the European Society of Radiology iGuide. Our prediction model for automated iGuide categorisation of referrals was also compared against the chatbots.

Methods: The iGuide justification of 143 real-world CT brain referrals, used to evaluate a prediction model, was analysed by two radiographers and radiologists. ChatGPT-4's and Gemini's imaging recommendations and pathology suspicions were compared with those of humans, with respect to referral completeness. Inter-rater reliability with κ statistics determined the agreement between entities.

Results: Chatbots' performance was limited (κ = 0.3) but improved for more complete referrals. The prediction model outperformed the chatbots in justification analysis (κ = 0.853). The chatbots' interpretations of complete referrals were highly consistent (49/52, 94.2%). The agreement regarding alternative imaging was high for both complete and ambiguous referrals, with ChatGPT and Gemini correctly identifying imaging modality and anatomical region in 83/96 (86.5%) and 81/96 (84.4%) cases, respectively.

Conclusion: The chatbots' ability to analyse the justification of adult brain CT referrals is limited to complete referrals, unlike our prediction model. Further research is needed to confirm these findings for other types of CT scans and modalities.

Relevance statement: ChatGPT and Gemini exhibit potential in justifying free text brain CT referrals; however, further improvements are required to handle real-world referrals of varying quality.

Key points: Custom prediction model's justification analysis strongly aligns with iGuide and surpasses chatbots. Chatbots incorrectly justified almost one-half of all CT brain referrals. Chatbots have limited performance in justifying ambiguous CT brain referrals. Chatbot performance improved when referrals were detailed and included suspected pathology.

Abstract Image

查看原文本刊更多论文

ChatGPT和Gemini能证明脑部CT转诊的合理性吗？与人类专家和自定义预测模型的比较研究。

背景：在欧洲，影像学转诊指南的缺乏导致大量不适当的计算机断层扫描（CT）扫描。公开可用的聊天机器人ChatGPT和Gemini为证明现实世界的推荐提供了另一种选择。最近的研究报告显示，在分析美国放射学会适当性标准变体时，ChatGPT具有很高的准确性。根据欧洲放射学会指南，我们比较了聊天机器人在解释、证明和建议非结构化成人脑CT转诊的替代成像方面的表现。我们对推荐的自动指南分类的预测模型也与聊天机器人进行了比较。方法：由两名放射科医师和两名放射科医师对143例真实CT脑转诊的指南合理性进行分析，用于评估预测模型。将ChatGPT-4和Gemini的影像学建议和病理怀疑与人类进行比较，以确定转诊的完整性。使用κ统计量的评分者间信度决定了实体之间的一致性。结果：聊天机器人的表现有限（κ = 0.3），但在更完整的推荐中有所改善。预测模型在合理性分析上优于聊天机器人（κ = 0.853）。聊天机器人对完整推荐的解释高度一致（49/ 52,94.2%）。对于完整和不明确的转诊，替代成像的一致性很高，ChatGPT和Gemini分别在83/96（86.5%）和81/96（84.4%）病例中正确识别成像方式和解剖区域。结论：与我们的预测模型不同，聊天机器人分析成人脑CT转诊理由的能力仅限于完成转诊。需要进一步的研究来证实其他类型的CT扫描和模式的这些发现。相关性声明：ChatGPT和Gemini在证明免费文本脑CT转诊方面表现出潜力；然而，需要进一步的改进来处理不同质量的实际转诊。关键点：自定义预测模型的合理性分析与guide非常一致，超过了聊天机器人。聊天机器人错误地证明了近一半的CT脑部转诊。聊天机器人在证明模棱两可的CT脑转诊方面表现有限。当转介详细并包括疑似病理时，聊天机器人的表现会有所改善。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊