Alignment between AI clinical decision tools and multidisciplinary tumor board decisions in prostate cancer.

IF 1.9 4区 医学 Q3 UROLOGY & NEPHROLOGY
Akif Koc, Oguzhan Akpinar, Alper Keskin, Halil Ibrahim Tarhan, Muhammet Guzelsoy
{"title":"Alignment between AI clinical decision tools and multidisciplinary tumor board decisions in prostate cancer.","authors":"Akif Koc, Oguzhan Akpinar, Alper Keskin, Halil Ibrahim Tarhan, Muhammet Guzelsoy","doi":"10.1007/s11255-026-05178-1","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>This study aimed to evaluate the concordance between treatment recommendations generated by LLMs and decisions made by a multidisciplinary uro-oncology tumor board.</p><p><strong>Methods: </strong>Forty-eight consecutive prostate cancer cases previously discussed at a multidisciplinary tumor board were retrospectively analyzed. For each case, treatment recommendations were generated using five LLM platforms (ChatGPT-4o, ChatGPT, Perplexity, Copilot, and DeepSeek) based on standardized clinical summaries. Four independent urology specialists evaluated the concordance between LLM recommendations and tumor board decisions using a 5-point Likert scale. Differences among models were assessed using the Friedman test followed by Bonferroni-corrected Wilcoxon signed-rank tests. Inter-rater agreement was calculated using the intraclass correlation coefficient.</p><p><strong>Results: </strong>Significant differences in concordance were observed among the evaluated AI platforms (χ<sup>2</sup> = 32.16, p < 0.001). Perplexity and ChatGPT-4o demonstrated the highest alignment with tumor board decisions, each achieving a median Likert score of 4.75, whereas Copilot showed the lowest concordance (median 3.00). DeepSeek and ChatGPT demonstrated intermediate performance. Post hoc analyses revealed that Perplexity significantly outperformed several lower-performing platforms; however, no statistically significant difference was observed between Perplexity and ChatGPT-4o (p = 0.149). Expert evaluations showed strong inter-rater agreement (ICC = 0.82).</p><p><strong>Conclusion: </strong>Large language models can demonstrate substantial concordance with multidisciplinary tumor board decisions in prostate cancer management. However, variability among models and the risk of hallucinated information indicate that LLMs should function as clinical decision-support tools under expert supervision rather than as autonomous decision-makers.</p>","PeriodicalId":14454,"journal":{"name":"International Urology and Nephrology","volume":" ","pages":""},"PeriodicalIF":1.9000,"publicationDate":"2026-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Urology and Nephrology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s11255-026-05178-1","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"UROLOGY & NEPHROLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Purpose: This study aimed to evaluate the concordance between treatment recommendations generated by LLMs and decisions made by a multidisciplinary uro-oncology tumor board.

Methods: Forty-eight consecutive prostate cancer cases previously discussed at a multidisciplinary tumor board were retrospectively analyzed. For each case, treatment recommendations were generated using five LLM platforms (ChatGPT-4o, ChatGPT, Perplexity, Copilot, and DeepSeek) based on standardized clinical summaries. Four independent urology specialists evaluated the concordance between LLM recommendations and tumor board decisions using a 5-point Likert scale. Differences among models were assessed using the Friedman test followed by Bonferroni-corrected Wilcoxon signed-rank tests. Inter-rater agreement was calculated using the intraclass correlation coefficient.

Results: Significant differences in concordance were observed among the evaluated AI platforms (χ2 = 32.16, p < 0.001). Perplexity and ChatGPT-4o demonstrated the highest alignment with tumor board decisions, each achieving a median Likert score of 4.75, whereas Copilot showed the lowest concordance (median 3.00). DeepSeek and ChatGPT demonstrated intermediate performance. Post hoc analyses revealed that Perplexity significantly outperformed several lower-performing platforms; however, no statistically significant difference was observed between Perplexity and ChatGPT-4o (p = 0.149). Expert evaluations showed strong inter-rater agreement (ICC = 0.82).

Conclusion: Large language models can demonstrate substantial concordance with multidisciplinary tumor board decisions in prostate cancer management. However, variability among models and the risk of hallucinated information indicate that LLMs should function as clinical decision-support tools under expert supervision rather than as autonomous decision-makers.

人工智能临床决策工具与前列腺癌多学科肿瘤委员会决策之间的一致性。
目的:本研究旨在评估法学硕士提出的治疗建议与多学科泌尿肿瘤委员会做出的决定之间的一致性。方法:回顾性分析先前在多学科肿瘤委员会讨论的48例连续前列腺癌病例。对于每个病例,根据标准化的临床总结,使用五个LLM平台(ChatGPT- 40、ChatGPT、Perplexity、Copilot和DeepSeek)生成治疗建议。四名独立的泌尿科专家使用5分李克特量表评估了法学硕士建议和肿瘤委员会决定之间的一致性。采用Friedman检验和Bonferroni-corrected Wilcoxon sign -rank检验来评估模型之间的差异。用类内相关系数计算组间一致性。结果:评估的人工智能平台之间的一致性存在显著差异(χ2 = 32.16, p)。结论:大型语言模型与前列腺癌管理的多学科肿瘤委员会决策具有显著的一致性。然而,模型之间的可变性和幻觉信息的风险表明,llm应该在专家监督下作为临床决策支持工具,而不是作为自主决策者。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
International Urology and Nephrology
International Urology and Nephrology 医学-泌尿学与肾脏学
CiteScore
3.40
自引率
5.00%
发文量
329
审稿时长
1.7 months
期刊介绍: International Urology and Nephrology publishes original papers on a broad range of topics in urology, nephrology and andrology. The journal integrates papers originating from clinical practice.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信
小红书