Anabia Sohail , Iyyakutti Iyappan Ganapathi , Basit Alawode , Sajid Javed , Mohammed Bennamoun , Arif Mahmood
{"title":"ConVLM: Context-guided vision-language model for fine-grained histopathology image classification","authors":"Anabia Sohail , Iyyakutti Iyappan Ganapathi , Basit Alawode , Sajid Javed , Mohammed Bennamoun , Arif Mahmood","doi":"10.1016/j.inffus.2025.103737","DOIUrl":null,"url":null,"abstract":"<div><div>Vision-Language Models (VLMs) have recently demonstrated exceptional results across various Computational Pathology (CPath) tasks, such as Whole Slide Image (WSI) classification and survival prediction. These models utilize large-scale datasets to align images and text by incorporating language priors during pre-training. However, the separate training of text and vision encoders in current VLMs leads to only coarse-level alignment, failing to capture the fine-level dependencies between image-text pairs. This limitation restricts their generalization in many downstream CPath tasks. In this paper, we propose a novel approach that enhances the capture of finer-level context through language priors, which better represent the fine-grained tissue morphological structures in histology images. We propose a Context-guided Vision-Language Model (ConVLM) that generates contextually relevant visual embeddings from histology images. ConVLM achieves this by employing context-guided token learning and token enhancement modules to identify and eliminate contextually irrelevant visual tokens, refining the visual representation. These two modules are integrated into various layers of the ConVLM encoders to progressively learn context-guided visual embeddings, enhancing visual-language interactions. The model is trained end-to-end using a context-guided token learning-based loss function. We conducted extensive experiments on 20 histopathology datasets, evaluating both Region of Interest (ROI)-level and cancer subtype WSI-level classification tasks. The results indicate that ConVLM significantly outperforms existing State-of-the-Art (SOTA) vision-language and foundational models. Our source code and pre-trained model is publicly available on: <span><span>https://github.com/BasitAlawode/ConVLM</span><svg><path></path></svg></span></div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"127 ","pages":"Article 103737"},"PeriodicalIF":15.5000,"publicationDate":"2025-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525007997","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Vision-Language Models (VLMs) have recently demonstrated exceptional results across various Computational Pathology (CPath) tasks, such as Whole Slide Image (WSI) classification and survival prediction. These models utilize large-scale datasets to align images and text by incorporating language priors during pre-training. However, the separate training of text and vision encoders in current VLMs leads to only coarse-level alignment, failing to capture the fine-level dependencies between image-text pairs. This limitation restricts their generalization in many downstream CPath tasks. In this paper, we propose a novel approach that enhances the capture of finer-level context through language priors, which better represent the fine-grained tissue morphological structures in histology images. We propose a Context-guided Vision-Language Model (ConVLM) that generates contextually relevant visual embeddings from histology images. ConVLM achieves this by employing context-guided token learning and token enhancement modules to identify and eliminate contextually irrelevant visual tokens, refining the visual representation. These two modules are integrated into various layers of the ConVLM encoders to progressively learn context-guided visual embeddings, enhancing visual-language interactions. The model is trained end-to-end using a context-guided token learning-based loss function. We conducted extensive experiments on 20 histopathology datasets, evaluating both Region of Interest (ROI)-level and cancer subtype WSI-level classification tasks. The results indicate that ConVLM significantly outperforms existing State-of-the-Art (SOTA) vision-language and foundational models. Our source code and pre-trained model is publicly available on: https://github.com/BasitAlawode/ConVLM
期刊介绍:
Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.