Evaluating Vision and Pathology Foundation Models for Computational Pathology: A Comprehensive Benchmark Study.

Research square Pub Date : 2025-07-04 DOI:10.21203/rs.3.rs-6823810/v1

Olivier Gevaert, Rohan Bareja, Francisco Carrillo-Perez, Yuanning Zheng, Marija Pizurica, Tarak Nandi, Jeanne Shen, Ravi Madduri

{"title":"Evaluating Vision and Pathology Foundation Models for Computational Pathology: A Comprehensive Benchmark Study.","authors":"Olivier Gevaert, Rohan Bareja, Francisco Carrillo-Perez, Yuanning Zheng, Marija Pizurica, Tarak Nandi, Jeanne Shen, Ravi Madduri","doi":"10.21203/rs.3.rs-6823810/v1","DOIUrl":null,"url":null,"abstract":"<p><p>To advance precision medicine in pathology, robust AI-driven foundation models are increasingly needed to uncover complex patterns in large-scale pathology datasets, enabling more accurate disease detection, classification, and prognostic insights. However, despite substantial progress in deep learning and computer vision, the comparative performance and generalizability of these pathology foundation models across diverse histopathological datasets and tasks remain largely unexamined. In this study, we conduct a comprehensive benchmarking of 31 AI foundation models for computational pathology, including general vision models (VM), general vision-language models (VLM), pathology-specific vision models (Path-VM), and pathology-specific vision-language models (Path-VLM), evaluated over 41 tasks sourced from TCGA, CPTAC, external benchmarking datasets, and out-of-domain datasets. Our study demonstrates that Virchow2, a pathology foundation model, delivered the highest performance across TCGA, CPTAC, and external tasks, highlighting its effectiveness in diverse histopathological evaluations. We also show that Path-VM outperformed both Path-VLM and VM, securing top rankings across tasks despite lacking a statistically significant edge over vision models. Our findings reveal that model size and data size did not consistently correlate with improved performance in pathology foundation models, challenging assumptions about scaling in histopathological applications. Lastly, our study demonstrates that a fusion model, integrating top-performing foundation models, achieved superior generalization across external tasks and diverse tissues in histopathological analysis. These findings emphasize the need for further research to understand the underlying factors influencing model performance and to develop strategies that enhance the generalizability and robustness of pathology-specific vision foundation models across different tissue types and datasets. PathBench : https://pathbench.stanford.edu/.</p>","PeriodicalId":519972,"journal":{"name":"Research square","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12236927/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Research square","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21203/rs.3.rs-6823810/v1","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

To advance precision medicine in pathology, robust AI-driven foundation models are increasingly needed to uncover complex patterns in large-scale pathology datasets, enabling more accurate disease detection, classification, and prognostic insights. However, despite substantial progress in deep learning and computer vision, the comparative performance and generalizability of these pathology foundation models across diverse histopathological datasets and tasks remain largely unexamined. In this study, we conduct a comprehensive benchmarking of 31 AI foundation models for computational pathology, including general vision models (VM), general vision-language models (VLM), pathology-specific vision models (Path-VM), and pathology-specific vision-language models (Path-VLM), evaluated over 41 tasks sourced from TCGA, CPTAC, external benchmarking datasets, and out-of-domain datasets. Our study demonstrates that Virchow2, a pathology foundation model, delivered the highest performance across TCGA, CPTAC, and external tasks, highlighting its effectiveness in diverse histopathological evaluations. We also show that Path-VM outperformed both Path-VLM and VM, securing top rankings across tasks despite lacking a statistically significant edge over vision models. Our findings reveal that model size and data size did not consistently correlate with improved performance in pathology foundation models, challenging assumptions about scaling in histopathological applications. Lastly, our study demonstrates that a fusion model, integrating top-performing foundation models, achieved superior generalization across external tasks and diverse tissues in histopathological analysis. These findings emphasize the need for further research to understand the underlying factors influencing model performance and to develop strategies that enhance the generalizability and robustness of pathology-specific vision foundation models across different tissue types and datasets. PathBench : https://pathbench.stanford.edu/.

查看原文本刊更多论文

评估计算病理学的视觉和病理学基础模型：一个全面的基准研究。

为了推进病理学领域的精准医学，越来越需要强大的人工智能驱动的基础模型来揭示大规模病理数据集中的复杂模式，从而实现更准确的疾病检测、分类和预后洞察。然而，尽管在深度学习和计算机视觉方面取得了实质性进展，但这些病理基础模型在不同组织病理学数据集和任务中的比较性能和通用性在很大程度上仍未得到检验。在本研究中，我们对31个用于计算病理学的AI基础模型进行了全面的基准测试，包括通用视觉模型（VM）、通用视觉语言模型（VLM）、病理特异性视觉模型（Path-VM）和病理特异性视觉语言模型（Path-VLM），评估了来自TCGA、CPTAC、外部基准数据集和域外数据集的41个任务。我们的研究表明，病理学基础模型Virchow2在TCGA、CPTAC和外部任务中表现最佳，突出了其在不同组织病理学评估中的有效性。我们还表明，Path-VM的表现优于Path-VLM和VM，尽管在统计上缺乏视觉模型的显著优势，但在任务中获得了最高排名。我们的研究结果表明，模型大小和数据大小并不总是与病理基础模型的性能提高相关，这对组织病理学应用中缩放的假设提出了挑战。最后，我们的研究表明，融合模型整合了最优秀的基础模型，在组织病理学分析中实现了跨外部任务和不同组织的卓越泛化。这些发现强调了进一步研究的必要性，以了解影响模型性能的潜在因素，并制定策略，提高病理特异性视觉基础模型在不同组织类型和数据集上的通用性和稳健性。PathBench: https://pathbench.stanford.edu/。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Research square

自引率

0.00%

发文量