Comparison of CT referral justification using clinical decision support and large language models in a large European cohort.

IF 4.7 2区医学 Q1 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING

European Radiology Pub Date : 2025-10-01 Epub Date: 2025-04-27 DOI:10.1007/s00330-025-11608-y

Mor Saban, Yaniv Alon, Osnat Luxenburg, Clara Singer, Monika Hierath, Alexandra Karoussou Schreiner, Boris Brkljačić, Jacob Sosna

{"title":"Comparison of CT referral justification using clinical decision support and large language models in a large European cohort.","authors":"Mor Saban, Yaniv Alon, Osnat Luxenburg, Clara Singer, Monika Hierath, Alexandra Karoussou Schreiner, Boris Brkljačić, Jacob Sosna","doi":"10.1007/s00330-025-11608-y","DOIUrl":null,"url":null,"abstract":"Background: Ensuring appropriate use of CT scans is critical for patient safety and resource optimization. Decision support tools and artificial intelligence (AI), such as large language models (LLMs), have the potential to improve CT referral justification, yet require rigorous evaluation against established standards and expert assessments.Aim: To evaluate the performance of LLMs (Generation Pre-trained Transformer 4 (GPT-4) and Claude-3 Haiku) and independent experts in justifying CT referrals compared to the ESR iGuide clinical decision support system as the reference standard.Methods: CT referral data from 6356 patients were retrospectively analyzed. Recommendations were generated by the ESR iGuide, LLMs, and independent experts, and evaluated for accuracy, precision, recall, F1 score, and Cohen's kappa across medical test, organ, and contrast predictions. Statistical analysis included demographic stratification, confidence intervals, and p-values to ensure robust comparisons.Results: Independent experts achieved the highest accuracy (92.4%) for medical test justification, surpassing GPT-4 (88.8%) and Claude-3 Haiku (85.2%). For organ predictions, LLMs performed comparably to experts, achieving accuracies of 75.3-77.8% versus 82.6%. For contrast predictions, GPT-4 showed the highest accuracy (57.4%) among models, while Claude demonstrated poor agreement with guidelines (kappa = 0.006).Conclusion: Independent experts remain the most reliable, but LLMs show potential for optimization, particularly in organ prediction. A hybrid human-AI approach could enhance CT referral appropriateness and utilization. Further research should focus on improving LLM performance and exploring their integration into clinical workflows.Key points: Question Can GPT-4 and Claude-3 Haiku justify CT referrals as accurately as independent experts, using the ESR iGuide as the gold standard? Findings Independent experts outperformed large language models in test justification. GPT-4 and Claude-3 showed comparable organ prediction but struggled with contrast selection, limiting full automation. Clinical relevance While independent experts remain most reliable, integrating AI with expert oversight may improve CT referral appropriateness, optimizing resource allocation and enhancing clinical decision-making.","PeriodicalId":12076,"journal":{"name":"European Radiology","volume":" ","pages":"6150-6159"},"PeriodicalIF":4.7000,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12417242/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Radiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s00330-025-11608-y","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/4/27 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Ensuring appropriate use of CT scans is critical for patient safety and resource optimization. Decision support tools and artificial intelligence (AI), such as large language models (LLMs), have the potential to improve CT referral justification, yet require rigorous evaluation against established standards and expert assessments.

Aim: To evaluate the performance of LLMs (Generation Pre-trained Transformer 4 (GPT-4) and Claude-3 Haiku) and independent experts in justifying CT referrals compared to the ESR iGuide clinical decision support system as the reference standard.

Methods: CT referral data from 6356 patients were retrospectively analyzed. Recommendations were generated by the ESR iGuide, LLMs, and independent experts, and evaluated for accuracy, precision, recall, F1 score, and Cohen's kappa across medical test, organ, and contrast predictions. Statistical analysis included demographic stratification, confidence intervals, and p-values to ensure robust comparisons.

Results: Independent experts achieved the highest accuracy (92.4%) for medical test justification, surpassing GPT-4 (88.8%) and Claude-3 Haiku (85.2%). For organ predictions, LLMs performed comparably to experts, achieving accuracies of 75.3-77.8% versus 82.6%. For contrast predictions, GPT-4 showed the highest accuracy (57.4%) among models, while Claude demonstrated poor agreement with guidelines (kappa = 0.006).

Conclusion: Independent experts remain the most reliable, but LLMs show potential for optimization, particularly in organ prediction. A hybrid human-AI approach could enhance CT referral appropriateness and utilization. Further research should focus on improving LLM performance and exploring their integration into clinical workflows.

Key points: Question Can GPT-4 and Claude-3 Haiku justify CT referrals as accurately as independent experts, using the ESR iGuide as the gold standard? Findings Independent experts outperformed large language models in test justification. GPT-4 and Claude-3 showed comparable organ prediction but struggled with contrast selection, limiting full automation. Clinical relevance While independent experts remain most reliable, integrating AI with expert oversight may improve CT referral appropriateness, optimizing resource allocation and enhancing clinical decision-making.

Abstract Image

查看原文本刊更多论文

在一个大型欧洲队列中使用临床决策支持和大型语言模型的CT转诊论证比较。

背景：确保适当使用CT扫描对患者安全和资源优化至关重要。决策支持工具和人工智能（AI），如大型语言模型（llm），有可能改善CT转诊的理由，但需要根据既定标准和专家评估进行严格评估。目的：评价LLMs （Generation Pre-trained Transformer 4 (GPT-4) and Claude-3 Haiku）和独立专家在推荐CT转诊方面的表现，并将ESR guide临床决策支持系统作为参考标准进行比较。方法：回顾性分析6356例患者的CT转诊资料。建议由ESR指南、法学硕士和独立专家提出，并对准确性、精密度、召回率、F1评分和科恩kappa在医学测试、器官和对比预测方面进行评估。统计分析包括人口分层、置信区间和p值，以确保可靠的比较。结果：独立专家对医学检验证明的准确率最高（92.4%），超过GPT-4（88.8%）和Claude-3俳句（85.2%）。在器官预测方面，法学硕士的表现与专家相当，准确率为75.3-77.8%，而专家的准确率为82.6%。对于对比预测，GPT-4在模型中显示出最高的准确性（57.4%），而Claude与指南的一致性较差（kappa = 0.006）。结论：独立专家仍然是最可靠的，但llm显示出优化的潜力，特别是在器官预测方面。人-人工智能混合方法可提高CT转诊的适宜性和利用率。进一步的研究应侧重于提高法学硕士的表现，并探索其与临床工作流程的整合。GPT-4和Claude-3俳句能否以ESR指南为金标准，像独立专家一样准确地证明CT转诊？独立专家在测试论证方面优于大型语言模型。GPT-4和Claude-3显示出类似的器官预测，但在对比选择方面存在困难，限制了完全自动化。虽然独立专家仍然是最可靠的，但将人工智能与专家监督相结合可能会提高CT转诊的适当性，优化资源分配并增强临床决策。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

European Radiology 医学-核医学

CiteScore

11.60

自引率

8.50%

发文量

874

审稿时长

2-4 weeks

期刊介绍： European Radiology (ER) continuously updates scientific knowledge in radiology by publication of strong original articles and state-of-the-art reviews written by leading radiologists. A well balanced combination of review articles, original papers, short communications from European radiological congresses and information on society matters makes ER an indispensable source for current information in this field. This is the Journal of the European Society of Radiology, and the official journal of a number of societies. From 2004-2008 supplements to European Radiology were published under its companion, European Radiology Supplements, ISSN 1613-3749.