ConVLM：上下文引导的细粒度组织病理学图像分类视觉语言模型

IF 15.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Information Fusion Pub Date : 2025-09-18 DOI:10.1016/j.inffus.2025.103737

Anabia Sohail , Iyyakutti Iyappan Ganapathi , Basit Alawode , Sajid Javed , Mohammed Bennamoun , Arif Mahmood

{"title":"ConVLM：上下文引导的细粒度组织病理学图像分类视觉语言模型","authors":"Anabia Sohail , Iyyakutti Iyappan Ganapathi , Basit Alawode , Sajid Javed , Mohammed Bennamoun , Arif Mahmood","doi":"10.1016/j.inffus.2025.103737","DOIUrl":null,"url":null,"abstract":"<div><div>Vision-Language Models (VLMs) have recently demonstrated exceptional results across various Computational Pathology (CPath) tasks, such as Whole Slide Image (WSI) classification and survival prediction. These models utilize large-scale datasets to align images and text by incorporating language priors during pre-training. However, the separate training of text and vision encoders in current VLMs leads to only coarse-level alignment, failing to capture the fine-level dependencies between image-text pairs. This limitation restricts their generalization in many downstream CPath tasks. In this paper, we propose a novel approach that enhances the capture of finer-level context through language priors, which better represent the fine-grained tissue morphological structures in histology images. We propose a Context-guided Vision-Language Model (ConVLM) that generates contextually relevant visual embeddings from histology images. ConVLM achieves this by employing context-guided token learning and token enhancement modules to identify and eliminate contextually irrelevant visual tokens, refining the visual representation. These two modules are integrated into various layers of the ConVLM encoders to progressively learn context-guided visual embeddings, enhancing visual-language interactions. The model is trained end-to-end using a context-guided token learning-based loss function. We conducted extensive experiments on 20 histopathology datasets, evaluating both Region of Interest (ROI)-level and cancer subtype WSI-level classification tasks. The results indicate that ConVLM significantly outperforms existing State-of-the-Art (SOTA) vision-language and foundational models. Our source code and pre-trained model is publicly available on: <span><span>https://github.com/BasitAlawode/ConVLM</span><svg><path></path></svg></span></div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"127 ","pages":"Article 103737"},"PeriodicalIF":15.5000,"publicationDate":"2025-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ConVLM: Context-guided vision-language model for fine-grained histopathology image classification\",\"authors\":\"Anabia Sohail , Iyyakutti Iyappan Ganapathi , Basit Alawode , Sajid Javed , Mohammed Bennamoun , Arif Mahmood\",\"doi\":\"10.1016/j.inffus.2025.103737\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Vision-Language Models (VLMs) have recently demonstrated exceptional results across various Computational Pathology (CPath) tasks, such as Whole Slide Image (WSI) classification and survival prediction. These models utilize large-scale datasets to align images and text by incorporating language priors during pre-training. However, the separate training of text and vision encoders in current VLMs leads to only coarse-level alignment, failing to capture the fine-level dependencies between image-text pairs. This limitation restricts their generalization in many downstream CPath tasks. In this paper, we propose a novel approach that enhances the capture of finer-level context through language priors, which better represent the fine-grained tissue morphological structures in histology images. We propose a Context-guided Vision-Language Model (ConVLM) that generates contextually relevant visual embeddings from histology images. ConVLM achieves this by employing context-guided token learning and token enhancement modules to identify and eliminate contextually irrelevant visual tokens, refining the visual representation. These two modules are integrated into various layers of the ConVLM encoders to progressively learn context-guided visual embeddings, enhancing visual-language interactions. The model is trained end-to-end using a context-guided token learning-based loss function. We conducted extensive experiments on 20 histopathology datasets, evaluating both Region of Interest (ROI)-level and cancer subtype WSI-level classification tasks. The results indicate that ConVLM significantly outperforms existing State-of-the-Art (SOTA) vision-language and foundational models. Our source code and pre-trained model is publicly available on: <span><span>https://github.com/BasitAlawode/ConVLM</span><svg><path></path></svg></span></div></div>\",\"PeriodicalId\":50367,\"journal\":{\"name\":\"Information Fusion\",\"volume\":\"127 \",\"pages\":\"Article 103737\"},\"PeriodicalIF\":15.5000,\"publicationDate\":\"2025-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Fusion\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1566253525007997\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525007997","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

视觉语言模型（VLMs）最近在各种计算病理学（CPath）任务中表现出优异的结果，例如全幻灯片图像（WSI）分类和生存预测。这些模型利用大规模数据集在预训练期间结合语言先验来对齐图像和文本。然而，在当前的VLMs中，文本和视觉编码器的单独训练只导致粗级对齐，无法捕获图像-文本对之间的精细依赖关系。这一限制限制了它们在许多下游CPath任务中的推广。在本文中，我们提出了一种新的方法，通过语言先验来增强对细粒度上下文的捕获，从而更好地表征组织学图像中的细粒度组织形态结构。我们提出了一个上下文引导的视觉语言模型（ConVLM），它从组织学图像中生成上下文相关的视觉嵌入。ConVLM通过使用上下文引导的令牌学习和令牌增强模块来识别和消除与上下文无关的视觉令牌，从而改进视觉表示来实现这一点。这两个模块被集成到ConVLM编码器的各个层中，以逐步学习上下文引导的视觉嵌入，增强视觉语言交互。使用基于上下文引导的token学习损失函数对模型进行端到端训练。我们在20个组织病理学数据集上进行了广泛的实验，评估了感兴趣区域（ROI）水平和癌症亚型wsi水平的分类任务。结果表明，ConVLM显著优于现有的最先进的视觉语言和基础模型。我们的源代码和预训练模型可以在：https://github.com/BasitAlawode/ConVLM上公开获得

本文章由计算机程序翻译，如有差异，请以英文原文为准。

ConVLM: Context-guided vision-language model for fine-grained histopathology image classification

查看原文本刊更多论文

ConVLM: Context-guided vision-language model for fine-grained histopathology image classification

Vision-Language Models (VLMs) have recently demonstrated exceptional results across various Computational Pathology (CPath) tasks, such as Whole Slide Image (WSI) classification and survival prediction. These models utilize large-scale datasets to align images and text by incorporating language priors during pre-training. However, the separate training of text and vision encoders in current VLMs leads to only coarse-level alignment, failing to capture the fine-level dependencies between image-text pairs. This limitation restricts their generalization in many downstream CPath tasks. In this paper, we propose a novel approach that enhances the capture of finer-level context through language priors, which better represent the fine-grained tissue morphological structures in histology images. We propose a Context-guided Vision-Language Model (ConVLM) that generates contextually relevant visual embeddings from histology images. ConVLM achieves this by employing context-guided token learning and token enhancement modules to identify and eliminate contextually irrelevant visual tokens, refining the visual representation. These two modules are integrated into various layers of the ConVLM encoders to progressively learn context-guided visual embeddings, enhancing visual-language interactions. The model is trained end-to-end using a context-guided token learning-based loss function. We conducted extensive experiments on 20 histopathology datasets, evaluating both Region of Interest (ROI)-level and cancer subtype WSI-level classification tasks. The results indicate that ConVLM significantly outperforms existing State-of-the-Art (SOTA) vision-language and foundational models. Our source code and pre-trained model is publicly available on: https://github.com/BasitAlawode/ConVLM

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Information Fusion 工程技术-计算机：理论方法

CiteScore

33.20

自引率

4.30%

发文量

161

审稿时长

7.9 months

期刊介绍： Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.