{"title":"Generalization-preserving adaptation of vision-language models for open-vocabulary segmentation","authors":"Zhen Chen, Hao Tang, Shiliang Zhang","doi":"10.1016/j.cviu.2025.104518","DOIUrl":null,"url":null,"abstract":"<div><div>Recent progress in large-scale Vision-Language Models (VLMs) has significantly advanced open-vocabulary segmentation. Previous works typically either generate class-agnostic masks and classify them with frozen VLMs, or align the mask generator features with VLM text features. These approaches face challenges of weak spatial discrimination ability of frozen VLMs and poor generalization due to unreliable vision-language alignment. This paper introduces a novel Generalization-Preserving Adaptation (GPA) of VLMs for open-vocabulary segmentation. GPA enhances the spatial discrimination capability of pre-trained VLMs through an efficient fine-tuning scheme, which incorporates a spatial adaptation module comprising spatial dependency modeling and low-rank feature modulation for preserving the feature space. Additionally, GPA proposes a context-aware feature aggregation module to extract mask features better aligned with the VLM features for mask classification. It performs decoupled context modeling that generates object-agnostic contextualized feature map and object-specific classification maps for accentuating discriminative and contextual clues. By maintaining the original VLM feature distribution for vision-language alignment, GPA effectively preserves the generalization capabilities of VLMs while enhancing segmentation performance. Extensive experiments on multiple open-vocabulary panoptic and semantic segmentation benchmarks demonstrate both superior effectiveness and generalization capabilities compared to previous works.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"261 ","pages":"Article 104518"},"PeriodicalIF":3.5000,"publicationDate":"2025-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314225002413","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Recent progress in large-scale Vision-Language Models (VLMs) has significantly advanced open-vocabulary segmentation. Previous works typically either generate class-agnostic masks and classify them with frozen VLMs, or align the mask generator features with VLM text features. These approaches face challenges of weak spatial discrimination ability of frozen VLMs and poor generalization due to unreliable vision-language alignment. This paper introduces a novel Generalization-Preserving Adaptation (GPA) of VLMs for open-vocabulary segmentation. GPA enhances the spatial discrimination capability of pre-trained VLMs through an efficient fine-tuning scheme, which incorporates a spatial adaptation module comprising spatial dependency modeling and low-rank feature modulation for preserving the feature space. Additionally, GPA proposes a context-aware feature aggregation module to extract mask features better aligned with the VLM features for mask classification. It performs decoupled context modeling that generates object-agnostic contextualized feature map and object-specific classification maps for accentuating discriminative and contextual clues. By maintaining the original VLM feature distribution for vision-language alignment, GPA effectively preserves the generalization capabilities of VLMs while enhancing segmentation performance. Extensive experiments on multiple open-vocabulary panoptic and semantic segmentation benchmarks demonstrate both superior effectiveness and generalization capabilities compared to previous works.
期刊介绍:
The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views.
Research Areas Include:
• Theory
• Early vision
• Data structures and representations
• Shape
• Range
• Motion
• Matching and recognition
• Architecture and languages
• Vision systems