PathVLM-Eval: Evaluation of open vision language models in histopathology

Q2 Medicine

Journal of Pathology Informatics Pub Date : 2025-06-05 DOI:10.1016/j.jpi.2025.100455

Nauman Ullah Gilal , Rachida Zegour , Khaled Al-Thelaya , Erdener Özer , Marco Agus , Jens Schneider , Sabri Boughorbel

{"title":"PathVLM-Eval: Evaluation of open vision language models in histopathology","authors":"Nauman Ullah Gilal , Rachida Zegour , Khaled Al-Thelaya , Erdener Özer , Marco Agus , Jens Schneider , Sabri Boughorbel","doi":"10.1016/j.jpi.2025.100455","DOIUrl":null,"url":null,"abstract":"<div><div>The emerging trend of vision language models (VLMs) has introduced a new paradigm in artificial intelligence (AI). However, their evaluation has predominantly focused on general-purpose datasets, providing a limited understanding of their effectiveness in specialized domains. Medical imaging, particularly digital pathology, could significantly benefit from VLMs for histological interpretation and diagnosis, enabling pathologists to use a complementary tool for faster morecomprehensive reporting and efficient healthcare service. In this work, we are interested in benchmarking VLMs on histopathology image understanding. We present an extensive evaluation of recent VLMs on the PathMMU dataset, a domain-specific benchmark that includes subsets such as PubMed, SocialPath, and EduContent. These datasets feature diverse formats, notably multiple-choice questions (MCQs), designed to aid pathologists in diagnostic reasoning and support professional development initiatives in histopathology. Utilizing VLMEvalKit, a widely used open-source evaluation framework—we bring publicly available pathology datasets under a single evaluation umbrella, ensuring unbiased and contamination-free assessments of model performance. Our study conducts extensive zero-shot evaluations of more than 60 state-of-the-art VLMs, including LLaVA, Qwen-VL, Qwen2-VL, InternVL, Phi3, Llama3, MOLMO, and XComposer series, significantly expanding the range of evaluated models compared to prior literature. Among the tested models, Qwen2-VL-72B-Instruct achieved superior performance with an average score of 63.97% outperforming other models across all PathMMU subsets. We conclude that this extensive evaluation will serve as a valuable resource, fostering the development of next-generation VLMs for analyzing digital pathology images. Additionally, we have released the complete evaluation results on our leaderboard PathVLM-Eval: <span><span>https://huggingface.co/spaces/gilalnauman/PathVLMs</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":37769,"journal":{"name":"Journal of Pathology Informatics","volume":"18 ","pages":"Article 100455"},"PeriodicalIF":0.0000,"publicationDate":"2025-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Pathology Informatics","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2153353925000409","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Medicine","Score":null,"Total":0}

引用次数: 0

Abstract

The emerging trend of vision language models (VLMs) has introduced a new paradigm in artificial intelligence (AI). However, their evaluation has predominantly focused on general-purpose datasets, providing a limited understanding of their effectiveness in specialized domains. Medical imaging, particularly digital pathology, could significantly benefit from VLMs for histological interpretation and diagnosis, enabling pathologists to use a complementary tool for faster morecomprehensive reporting and efficient healthcare service. In this work, we are interested in benchmarking VLMs on histopathology image understanding. We present an extensive evaluation of recent VLMs on the PathMMU dataset, a domain-specific benchmark that includes subsets such as PubMed, SocialPath, and EduContent. These datasets feature diverse formats, notably multiple-choice questions (MCQs), designed to aid pathologists in diagnostic reasoning and support professional development initiatives in histopathology. Utilizing VLMEvalKit, a widely used open-source evaluation framework—we bring publicly available pathology datasets under a single evaluation umbrella, ensuring unbiased and contamination-free assessments of model performance. Our study conducts extensive zero-shot evaluations of more than 60 state-of-the-art VLMs, including LLaVA, Qwen-VL, Qwen2-VL, InternVL, Phi3, Llama3, MOLMO, and XComposer series, significantly expanding the range of evaluated models compared to prior literature. Among the tested models, Qwen2-VL-72B-Instruct achieved superior performance with an average score of 63.97% outperforming other models across all PathMMU subsets. We conclude that this extensive evaluation will serve as a valuable resource, fostering the development of next-generation VLMs for analyzing digital pathology images. Additionally, we have released the complete evaluation results on our leaderboard PathVLM-Eval: https://huggingface.co/spaces/gilalnauman/PathVLMs.

查看原文本刊更多论文

PathVLM-Eval：开放视觉语言模型在组织病理学上的评价

视觉语言模型（VLMs）的兴起为人工智能（AI）引入了一个新的范式。然而，它们的评估主要集中在通用数据集上，对它们在专门领域的有效性提供了有限的理解。医学成像，特别是数字病理学，可以从VLMs中显著受益，用于组织学解释和诊断，使病理学家能够使用补充工具，更快、更全面地报告和有效地提供医疗保健服务。在这项工作中，我们感兴趣的是在组织病理学图像理解上对vlm进行基准测试。我们在PathMMU数据集上对最近的vlm进行了广泛的评估，这是一个特定领域的基准，包括PubMed、SocialPath和EduContent等子集。这些数据集具有多种格式，特别是多项选择题（mcq），旨在帮助病理学家进行诊断推理并支持组织病理学的专业发展倡议。利用VLMEvalKit（一个广泛使用的开源评估框架），我们将公开可用的病理数据集放在一个评估伞下，确保对模型性能进行公正和无污染的评估。我们的研究对60多个最先进的vlm进行了广泛的零射击评估，包括LLaVA， Qwen-VL, Qwen2-VL, InternVL, Phi3, Llama3， MOLMO和XComposer系列，与先前的文献相比，显著扩大了评估模型的范围。在测试的模型中，Qwen2-VL-72B-Instruct在所有PathMMU子集上的平均得分为63.97%，优于其他模型。我们的结论是，这种广泛的评估将作为一种宝贵的资源，促进下一代VLMs的发展，用于分析数字病理图像。此外，我们已经在我们的排行榜PathVLM-Eval: https://huggingface.co/spaces/gilalnauman/PathVLMs上发布了完整的评估结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Pathology Informatics Medicine-Pathology and Forensic Medicine

CiteScore

3.70

自引率

0.00%

发文量

审稿时长

18 weeks

期刊介绍： The Journal of Pathology Informatics (JPI) is an open access peer-reviewed journal dedicated to the advancement of pathology informatics. This is the official journal of the Association for Pathology Informatics (API). The journal aims to publish broadly about pathology informatics and freely disseminate all articles worldwide. This journal is of interest to pathologists, informaticians, academics, researchers, health IT specialists, information officers, IT staff, vendors, and anyone with an interest in informatics. We encourage submissions from anyone with an interest in the field of pathology informatics. We publish all types of papers related to pathology informatics including original research articles, technical notes, reviews, viewpoints, commentaries, editorials, symposia, meeting abstracts, book reviews, and correspondence to the editors. All submissions are subject to rigorous peer review by the well-regarded editorial board and by expert referees in appropriate specialties.