Benchmarking foundation models as feature extractors for weakly supervised computational pathology.

IF 26.8 1区医学 Q1 ENGINEERING, BIOMEDICAL

Nature Biomedical Engineering Pub Date : 2025-10-01 DOI:10.1038/s41551-025-01516-3

Peter Neidlinger, Omar S M El Nahhas, Hannah Sophie Muti, Tim Lenz, Michael Hoffmeister, Hermann Brenner, Marko van Treeck, Rupert Langer, Bastian Dislich, Hans Michael Behrens, Christoph Röcken, Sebastian Foersch, Daniel Truhn, Antonio Marra, Oliver Lester Saldanha, Jakob Nikolas Kather

{"title":"Benchmarking foundation models as feature extractors for weakly supervised computational pathology.","authors":"Peter Neidlinger, Omar S M El Nahhas, Hannah Sophie Muti, Tim Lenz, Michael Hoffmeister, Hermann Brenner, Marko van Treeck, Rupert Langer, Bastian Dislich, Hans Michael Behrens, Christoph Röcken, Sebastian Foersch, Daniel Truhn, Antonio Marra, Oliver Lester Saldanha, Jakob Nikolas Kather","doi":"10.1038/s41551-025-01516-3","DOIUrl":null,"url":null,"abstract":"<p><p>Numerous pathology foundation models have been developed to extract clinically relevant information. There is currently limited literature independently evaluating these foundation models on external cohorts and clinically relevant tasks to uncover adjustments for future improvements. Here we benchmark 19 histopathology foundation models on 13 patient cohorts with 6,818 patients and 9,528 slides from lung, colorectal, gastric and breast cancers. The models were evaluated on weakly supervised tasks related to biomarkers, morphological properties and prognostic outcomes. We show that a vision-language foundation model, CONCH, yielded the highest overall performance when compared with vision-only foundation models, with Virchow2 as close second, although its superior performance was less pronounced in low-data scenarios and low-prevalence tasks. The experiments reveal that foundation models trained on distinct cohorts learn complementary features to predict the same label, and can be fused to outperform the current state of the art. An ensemble combining CONCH and Virchow2 predictions outperformed individual models in 55% of tasks, leveraging their complementary strengths in classification scenarios. Moreover, our findings suggest that data diversity outweighs data volume for foundation models.</p>","PeriodicalId":19063,"journal":{"name":"Nature Biomedical Engineering","volume":" ","pages":""},"PeriodicalIF":26.8000,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Nature Biomedical Engineering","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1038/s41551-025-01516-3","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, BIOMEDICAL","Score":null,"Total":0}

引用次数: 0

Abstract

Numerous pathology foundation models have been developed to extract clinically relevant information. There is currently limited literature independently evaluating these foundation models on external cohorts and clinically relevant tasks to uncover adjustments for future improvements. Here we benchmark 19 histopathology foundation models on 13 patient cohorts with 6,818 patients and 9,528 slides from lung, colorectal, gastric and breast cancers. The models were evaluated on weakly supervised tasks related to biomarkers, morphological properties and prognostic outcomes. We show that a vision-language foundation model, CONCH, yielded the highest overall performance when compared with vision-only foundation models, with Virchow2 as close second, although its superior performance was less pronounced in low-data scenarios and low-prevalence tasks. The experiments reveal that foundation models trained on distinct cohorts learn complementary features to predict the same label, and can be fused to outperform the current state of the art. An ensemble combining CONCH and Virchow2 predictions outperformed individual models in 55% of tasks, leveraging their complementary strengths in classification scenarios. Moreover, our findings suggest that data diversity outweighs data volume for foundation models.

查看原文本刊更多论文

基准基础模型作为弱监督计算病理学的特征提取器。

许多病理基础模型已经开发，以提取临床相关信息。目前有有限的文献在外部队列和临床相关任务中独立评估这些基础模型，以发现未来改进的调整。在这里，我们对13个患者队列的6,818例患者和9,528张肺癌、结直肠癌、胃癌和乳腺癌的切片进行了19个组织病理学基础模型的基准测试。这些模型在与生物标志物、形态特性和预后结果相关的弱监督任务上进行评估。我们表明，视觉语言基础模型CONCH与纯视觉基础模型相比，产生了最高的整体性能，Virchow2紧随其后，尽管其优越的性能在低数据场景和低流行率任务中不太明显。实验表明，在不同队列上训练的基础模型学习互补特征来预测相同的标签，并且可以融合以超越当前的技术状态。结合CONCH和Virchow2预测的集成在55%的任务中优于单个模型，利用它们在分类场景中的互补优势。此外，我们的研究结果表明，数据多样性比基础模型的数据量更重要。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Nature Biomedical Engineering Medicine-Medicine (miscellaneous)

CiteScore

45.30

自引率

1.10%

发文量

138

期刊介绍： Nature Biomedical Engineering is an online-only monthly journal that was launched in January 2017. It aims to publish original research, reviews, and commentary focusing on applied biomedicine and health technology. The journal targets a diverse audience, including life scientists who are involved in developing experimental or computational systems and methods to enhance our understanding of human physiology. It also covers biomedical researchers and engineers who are engaged in designing or optimizing therapies, assays, devices, or procedures for diagnosing or treating diseases. Additionally, clinicians, who make use of research outputs to evaluate patient health or administer therapy in various clinical settings and healthcare contexts, are also part of the target audience.