Alignment between AI clinical decision tools and multidisciplinary tumor board decisions in prostate cancer.

IF 1.9 4区医学 Q3 UROLOGY & NEPHROLOGY

International Urology and Nephrology Pub Date : 2026-05-06 DOI:10.1007/s11255-026-05178-1

Akif Koc, Oguzhan Akpinar, Alper Keskin, Halil Ibrahim Tarhan, Muhammet Guzelsoy

{"title":"Alignment between AI clinical decision tools and multidisciplinary tumor board decisions in prostate cancer.","authors":"Akif Koc, Oguzhan Akpinar, Alper Keskin, Halil Ibrahim Tarhan, Muhammet Guzelsoy","doi":"10.1007/s11255-026-05178-1","DOIUrl":null,"url":null,"abstract":"Purpose: This study aimed to evaluate the concordance between treatment recommendations generated by LLMs and decisions made by a multidisciplinary uro-oncology tumor board.Methods: Forty-eight consecutive prostate cancer cases previously discussed at a multidisciplinary tumor board were retrospectively analyzed. For each case, treatment recommendations were generated using five LLM platforms (ChatGPT-4o, ChatGPT, Perplexity, Copilot, and DeepSeek) based on standardized clinical summaries. Four independent urology specialists evaluated the concordance between LLM recommendations and tumor board decisions using a 5-point Likert scale. Differences among models were assessed using the Friedman test followed by Bonferroni-corrected Wilcoxon signed-rank tests. Inter-rater agreement was calculated using the intraclass correlation coefficient.Results: Significant differences in concordance were observed among the evaluated AI platforms (χ2 = 32.16, p < 0.001). Perplexity and ChatGPT-4o demonstrated the highest alignment with tumor board decisions, each achieving a median Likert score of 4.75, whereas Copilot showed the lowest concordance (median 3.00). DeepSeek and ChatGPT demonstrated intermediate performance. Post hoc analyses revealed that Perplexity significantly outperformed several lower-performing platforms; however, no statistically significant difference was observed between Perplexity and ChatGPT-4o (p = 0.149). Expert evaluations showed strong inter-rater agreement (ICC = 0.82).Conclusion: Large language models can demonstrate substantial concordance with multidisciplinary tumor board decisions in prostate cancer management. However, variability among models and the risk of hallucinated information indicate that LLMs should function as clinical decision-support tools under expert supervision rather than as autonomous decision-makers.","PeriodicalId":14454,"journal":{"name":"International Urology and Nephrology","volume":" ","pages":""},"PeriodicalIF":1.9000,"publicationDate":"2026-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Urology and Nephrology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s11255-026-05178-1","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"UROLOGY & NEPHROLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose: This study aimed to evaluate the concordance between treatment recommendations generated by LLMs and decisions made by a multidisciplinary uro-oncology tumor board.

Methods: Forty-eight consecutive prostate cancer cases previously discussed at a multidisciplinary tumor board were retrospectively analyzed. For each case, treatment recommendations were generated using five LLM platforms (ChatGPT-4o, ChatGPT, Perplexity, Copilot, and DeepSeek) based on standardized clinical summaries. Four independent urology specialists evaluated the concordance between LLM recommendations and tumor board decisions using a 5-point Likert scale. Differences among models were assessed using the Friedman test followed by Bonferroni-corrected Wilcoxon signed-rank tests. Inter-rater agreement was calculated using the intraclass correlation coefficient.

Results: Significant differences in concordance were observed among the evaluated AI platforms (χ² = 32.16, p < 0.001). Perplexity and ChatGPT-4o demonstrated the highest alignment with tumor board decisions, each achieving a median Likert score of 4.75, whereas Copilot showed the lowest concordance (median 3.00). DeepSeek and ChatGPT demonstrated intermediate performance. Post hoc analyses revealed that Perplexity significantly outperformed several lower-performing platforms; however, no statistically significant difference was observed between Perplexity and ChatGPT-4o (p = 0.149). Expert evaluations showed strong inter-rater agreement (ICC = 0.82).

Conclusion: Large language models can demonstrate substantial concordance with multidisciplinary tumor board decisions in prostate cancer management. However, variability among models and the risk of hallucinated information indicate that LLMs should function as clinical decision-support tools under expert supervision rather than as autonomous decision-makers.

查看原文本刊更多论文

人工智能临床决策工具与前列腺癌多学科肿瘤委员会决策之间的一致性。

目的：本研究旨在评估法学硕士提出的治疗建议与多学科泌尿肿瘤委员会做出的决定之间的一致性。方法：回顾性分析先前在多学科肿瘤委员会讨论的48例连续前列腺癌病例。对于每个病例，根据标准化的临床总结，使用五个LLM平台（ChatGPT- 40、ChatGPT、Perplexity、Copilot和DeepSeek）生成治疗建议。四名独立的泌尿科专家使用5分李克特量表评估了法学硕士建议和肿瘤委员会决定之间的一致性。采用Friedman检验和Bonferroni-corrected Wilcoxon sign -rank检验来评估模型之间的差异。用类内相关系数计算组间一致性。结果：评估的人工智能平台之间的一致性存在显著差异（χ2 = 32.16, p）。结论：大型语言模型与前列腺癌管理的多学科肿瘤委员会决策具有显著的一致性。然而，模型之间的可变性和幻觉信息的风险表明，llm应该在专家监督下作为临床决策支持工具，而不是作为自主决策者。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Urology and Nephrology 医学-泌尿学与肾脏学

CiteScore

3.40

自引率

5.00%

发文量

329

审稿时长

1.7 months

期刊介绍： International Urology and Nephrology publishes original papers on a broad range of topics in urology, nephrology and andrology. The journal integrates papers originating from clinical practice.