The imitation game: large language models versus multidisciplinary tumor boards: benchmarking AI against 21 sarcoma centers from the ring trial.

IF 2.8 3区医学 Q3 ONCOLOGY

Journal of Cancer Research and Clinical Oncology Pub Date : 2025-09-10 DOI:10.1007/s00432-025-06304-9

Cheng-Peng Li, Aimé Terence Kalisa, Siyer Roohani, Kamal Hummedah, Franka Menge, Christoph Reißfelder, Markus Albertsmeier, Bernd Kasper, Jens Jakob, Cui Yang

{"title":"The imitation game: large language models versus multidisciplinary tumor boards: benchmarking AI against 21 sarcoma centers from the ring trial.","authors":"Cheng-Peng Li, Aimé Terence Kalisa, Siyer Roohani, Kamal Hummedah, Franka Menge, Christoph Reißfelder, Markus Albertsmeier, Bernd Kasper, Jens Jakob, Cui Yang","doi":"10.1007/s00432-025-06304-9","DOIUrl":null,"url":null,"abstract":"Purpose: The study aims to compare the treatment recommendations generated by four leading large language models (LLMs) with those from 21 sarcoma centers' multidisciplinary tumor boards (MTBs) of the sarcoma ring trial in managing complex soft tissue sarcoma (STS) cases.Methods: We simulated STS-MTBs using four LLMs-Llama 3.2-vison: 90b, Claude 3.5 Sonnet, DeepSeek-R1, and OpenAI-o1 across five anonymized STS cases from the sarcoma ring trial. Each model was queried 21 times per case using a standardized prompt, and the responses were compared with human MTBs in terms of intra-model consistency, treatment recommendation alignment, alternative recommendations, and source citation.Results: LLMs demonstrated high inter-model and intra-model consistency in only 20% of cases, and their recommendations aligned with human consensus in only 20-60% of cases. The model with the highest concordance with the most common MTB recommendation, Claude 3.5 Sonnet, aligned with experts in only 60% of cases. Notably, the recommendations across MTBs were highly heterogenous, contextualizing the variable LLM performance. Discrepancies were particularly notable, where common human recommendations were often absent in LLM outputs. Additionally, the sources for the recommendation rationale of LLMs were clearly derived from the German S3 sarcoma guidelines in only 24.8% to 55.2% of the responses. LLMs occasionally suggested potentially harmful information were also observed in alternative recommendations.Conclusions: Despite the considerable heterogeneity observed in MTB recommendations, the significant discrepancies and potentially harmful recommendations highlight current AI tools' limitations, underscoring that referral to high-volume sarcoma centers remains essential for optimal patient care. At the same time, LLMs could serve as an excellent tool to prepare for MDT discussions.","PeriodicalId":15118,"journal":{"name":"Journal of Cancer Research and Clinical Oncology","volume":"151 9","pages":"248"},"PeriodicalIF":2.8000,"publicationDate":"2025-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12420562/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Cancer Research and Clinical Oncology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s00432-025-06304-9","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ONCOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose: The study aims to compare the treatment recommendations generated by four leading large language models (LLMs) with those from 21 sarcoma centers' multidisciplinary tumor boards (MTBs) of the sarcoma ring trial in managing complex soft tissue sarcoma (STS) cases.

Methods: We simulated STS-MTBs using four LLMs-Llama 3.2-vison: 90b, Claude 3.5 Sonnet, DeepSeek-R1, and OpenAI-o1 across five anonymized STS cases from the sarcoma ring trial. Each model was queried 21 times per case using a standardized prompt, and the responses were compared with human MTBs in terms of intra-model consistency, treatment recommendation alignment, alternative recommendations, and source citation.

Results: LLMs demonstrated high inter-model and intra-model consistency in only 20% of cases, and their recommendations aligned with human consensus in only 20-60% of cases. The model with the highest concordance with the most common MTB recommendation, Claude 3.5 Sonnet, aligned with experts in only 60% of cases. Notably, the recommendations across MTBs were highly heterogenous, contextualizing the variable LLM performance. Discrepancies were particularly notable, where common human recommendations were often absent in LLM outputs. Additionally, the sources for the recommendation rationale of LLMs were clearly derived from the German S3 sarcoma guidelines in only 24.8% to 55.2% of the responses. LLMs occasionally suggested potentially harmful information were also observed in alternative recommendations.

Conclusions: Despite the considerable heterogeneity observed in MTB recommendations, the significant discrepancies and potentially harmful recommendations highlight current AI tools' limitations, underscoring that referral to high-volume sarcoma centers remains essential for optimal patient care. At the same time, LLMs could serve as an excellent tool to prepare for MDT discussions.

Abstract Image

查看原文本刊更多论文

模仿游戏：大型语言模型与多学科肿瘤委员会：将人工智能与来自环试验的21个肉瘤中心进行基准测试。

目的：该研究旨在比较由四种领先的大型语言模型（LLMs）产生的治疗建议与来自21个肉瘤中心的多学科肿瘤委员会（MTBs）的肉瘤环试验中治疗复杂软组织肉瘤（STS）病例的建议。方法：我们使用四种LLMs-Llama 3.2- vision: 90b、Claude 3.5 Sonnet、DeepSeek-R1和openai - 01模拟来自肉瘤环试验的5例匿名STS病例的STS- mbs。每个模型使用标准化提示查询21次，并在模型内一致性、治疗建议一致性、替代建议和来源引用方面与人类MTBs进行比较。结果：法学硕士仅在20%的案例中表现出高度的模型间和模型内一致性，其建议仅在20-60%的案例中与人类共识一致。与最常见的结核结核建议（Claude 3.5 Sonnet）一致性最高的模型仅在60%的病例中与专家一致。值得注意的是，跨mtb的建议是高度异质的，将可变的LLM性能置于上下文中。差异尤其显著，法学硕士输出中经常缺少常见的人类建议。此外，只有24.8%至55.2%的应答者明确表示llm的推荐依据来自德国S3肉瘤指南。法学硕士偶尔也会在替代建议中发现可能有害的信息。结论：尽管在MTB推荐中观察到相当大的异质性，但显著的差异和潜在的有害建议突出了当前人工智能工具的局限性，强调转诊到大容量肉瘤中心仍然是最佳患者护理的必要条件。同时，法学硕士可以作为一个很好的工具，为MDT的讨论做准备。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Cancer Research and Clinical Oncology 医学-肿瘤学

CiteScore

4.00

自引率

2.80%

发文量

577

审稿时长

2 months

期刊介绍： The "Journal of Cancer Research and Clinical Oncology" publishes significant and up-to-date articles within the fields of experimental and clinical oncology. The journal, which is chiefly devoted to Original papers, also includes Reviews as well as Editorials and Guest editorials on current, controversial topics. The section Letters to the editors provides a forum for a rapid exchange of comments and information concerning previously published papers and topics of current interest. Meeting reports provide current information on the latest results presented at important congresses. The following fields are covered: carcinogenesis - etiology, mechanisms; molecular biology; recent developments in tumor therapy; general diagnosis; laboratory diagnosis; diagnostic and experimental pathology; oncologic surgery; and epidemiology.