International Journal of Computer Vision最新文献_第7页

Learning to Generalize Heterogeneous Representation for Cross-Modality Image Synthesis via Multiple Domain Interventions 基于多域干预的跨模态图像合成异构表征学习

IF 19.5 2区计算机科学

International Journal of Computer Vision Pub Date : 2025-03-19 DOI: 10.1007/s11263-025-02381-y

Yawen Huang, Huimin Huang, Hao Zheng, Yuexiang Li, Feng Zheng, Xiantong Zhen, Yefeng Zheng

{"title":"Learning to Generalize Heterogeneous Representation for Cross-Modality Image Synthesis via Multiple Domain Interventions","authors":"Yawen Huang, Huimin Huang, Hao Zheng, Yuexiang Li, Feng Zheng, Xiantong Zhen, Yefeng Zheng","doi":"10.1007/s11263-025-02381-y","DOIUrl":"https://doi.org/10.1007/s11263-025-02381-y","url":null,"abstract":"Magnetic resonance imaging with modality diversity substantially increases productivity in routine diagnosis and advanced research. However, high inter-equipment variability and expensive examination cost remain as key challenges in acquiring and utilizing multi-modal images. Missing modalities often can be synthesized from existing ones. While the rapid growth in image style transfer with deep models overwhelms the above endeavor, such image synthesis may not always be achievable and even impractical when applied to medical data. The proposed method addresses this issue by a convolutional sparse coding (CSC) adaptation network to handle the lacking of generalizing medical image representation learning. We reduce both inter-domain and intra-domain divergences by the domain-adaptation and domain-standardization modules, respectively. On the basis of CSC features, we penalize their subspace mismatching to reduce the generalization error. The overall framework is cast in a minimax setting, and the extensive experiments show that the proposed method yields state-of-the-art results on multiple datasets.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"25 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143653351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Fully Decoupled End-to-End Person Search: An Approach without Conflicting Objectives 完全解耦的端到端人员搜索：一种没有冲突目标的方法

IF 19.5 2区计算机科学

International Journal of Computer Vision Pub Date : 2025-03-19 DOI: 10.1007/s11263-025-02407-5

Pengcheng Zhang, Xiaohan Yu, Xiao Bai, Jin Zheng, Xin Ning, Edwin R. Hancock

{"title":"Fully Decoupled End-to-End Person Search: An Approach without Conflicting Objectives","authors":"Pengcheng Zhang, Xiaohan Yu, Xiao Bai, Jin Zheng, Xin Ning, Edwin R. Hancock","doi":"10.1007/s11263-025-02407-5","DOIUrl":"https://doi.org/10.1007/s11263-025-02407-5","url":null,"abstract":"End-to-end person search aims to jointly detect and re-identify a target person in raw scene images with a unified model. The detection sub-task learns to identify all persons as one category while the re-identification (re-id) sub-task aims to discriminate persons of different identities, resulting in conflicting optimal objectives. Existing works proposed to decouple end-to-end person search to alleviate such conflict. Yet these methods are still sub-optimal on the sub-tasks due to their partially decoupled models, which limits the overall person search performance. To further eliminate the last coupled part in decoupled models without sacrificing the efficiency of end-to-end person search, we propose a fully decoupled person search framework in this work. Specifically, we design a task-incremental network to construct an end-to-end model in a task-incremental learning procedure. Given that the detection subtask is easier, we start by training a lightweight detection sub-network and expand it with a re-id sub-network trained in another stage. On top of the fully decoupled design, we also enable one-stage training for the task-incremental network. The fully decoupled framework further allows an Online Representation Distillation to mitigate the representation gap between the end-to-end model and two-step models for learning robust representations. Without requiring an offline teacher re-id model, this transfers structured representational knowledge learned from cropped images to the person search model. The learned person representations thus focus more on discriminative clues of foreground persons and suppress the distractive background information. To understand the effectiveness and efficiency of the proposed method, we conduct comprehensive experimental evaluations on two popular person search datasets PRW and CUHK-SYSU. The experimental results demonstrate that the fully decoupled model achieves superior performance than previous decoupled methods. The inference of the model is also shown to be efficient among recent end-to-end methods. The source code is available at https://github.com/PatrickZad/fdps.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"33 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143653347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Fusion for Visual-Infrared Person ReID in Real-World Surveillance Using Corrupted Multimodal Data 基于多模态数据的视红外人脸识别融合

IF 19.5 2区计算机科学

International Journal of Computer Vision Pub Date : 2025-03-18 DOI: 10.1007/s11263-025-02396-5

Arthur Josi, Mahdi Alehdaghi, Rafael M. O. Cruz, Eric Granger

{"title":"Fusion for Visual-Infrared Person ReID in Real-World Surveillance Using Corrupted Multimodal Data","authors":"Arthur Josi, Mahdi Alehdaghi, Rafael M. O. Cruz, Eric Granger","doi":"10.1007/s11263-025-02396-5","DOIUrl":"https://doi.org/10.1007/s11263-025-02396-5","url":null,"abstract":"Visible-infrared person re-identification (V-I ReID) seeks to match images of individuals captured over a distributed network of RGB and IR cameras. The task is challenging due to the significant differences between V and I modalities, especially under real-world conditions, where images face corruptions such as blur, noise, and weather. Despite their practical relevance, deep learning models for multimodal V-I ReID remain far less investigated than for single and cross-modal V to I settings. Moreover, state-of-art V-I ReID models cannot leverage corrupted modality information to sustain a high level of accuracy. In this paper, we propose an efficient model for multimodal V-I ReID – named Multimodal Middle Stream Fusion (MMSF) – that preserves modality-specific knowledge for improved robustness to corrupted multimodal images. In addition, three state-of-art attention-based multimodal fusion models are adapted to address corrupted multimodal data in V-I ReID, allowing for dynamic balancing of the importance of each modality. The literature typically reports ReID performance using clean datasets, but more recently, evaluation protocols have been proposed to assess the robustness of ReID models under challenging real-world scenarios, using data with realistic corruptions. However, these protocols are limited to unimodal V settings. For realistic evaluation of multimodal (and cross-modal) V-I person ReID models, we propose new challenging corrupted datasets for scenarios where V and I cameras are co-located (CL) and not co-located (NCL). Finally, the benefits of our Masking and Local Multimodal Data Augmentation (ML-MDA) strategy are explored to improve the robustness of ReID models to multimodal corruption. Our experiments on clean and corrupted versions of the SYSU-MM01, RegDB, and ThermalWORLD datasets indicate the multimodal V-I ReID models that are more likely to perform well in real-world operational conditions. In particular, the proposed ML-MDA is shown as essential for a V-I person ReID system to sustain high accuracy and robustness in face of corrupted multimodal images. Our multimodal ReID models attains the best accuracy and complexity trade-off under both CL and NCL settings and compared to state-of-art unimodal ReID systems, except for the ThermalWORLD dataset due to its low-quality I. Our MMSF model outperforms every method under CL and NCL camera scenarios. GitHub code: https://github.com/art2611/MREiD-UCD-CCD.git.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"183 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143640779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Solution to Co-occurrence Bias in Pedestrian Attribute Recognition: Theory, Algorithms, and Improvements 行人属性识别中共现偏差的解决方法：理论、算法与改进

IF 19.5 2区计算机科学

International Journal of Computer Vision Pub Date : 2025-03-18 DOI: 10.1007/s11263-025-02405-7

Yibo Zhou, Hai-Miao Hu, Jinzuo Yu, Haotian Wu, Shiliang Pu, Hanzi Wang

{"title":"A Solution to Co-occurrence Bias in Pedestrian Attribute Recognition: Theory, Algorithms, and Improvements","authors":"Yibo Zhou, Hai-Miao Hu, Jinzuo Yu, Haotian Wu, Shiliang Pu, Hanzi Wang","doi":"10.1007/s11263-025-02405-7","DOIUrl":"https://doi.org/10.1007/s11263-025-02405-7","url":null,"abstract":"For the pedestrian attributes recognition, we demonstrate that deep models can memorize the pattern of attributes co-occurrences inherent to dataset, whether through explicit or implicit means. However, since the attributes interdependency is highly variable and unpredictable across different scenarios, the modeled attributes co-occurrences de facto serve as a data selection bias that hardly generalizes onto out-of-distribution samples. To address this thorny issue, we formulate a novel concept of attributes-disentangled feature learning, by which the mutual information among features of different attributes is minimized, ensuring the recognition of an attribute independent to the presence of others. Stemming from it, practical approaches are developed to effectively decouple attributes by suppressing the shared feature factors among attributes-specific features. As compelling merits, our method is exercised with minimal test-time computation, and is also highly extendable. With slight modifications on it, further improvements regarding better exploration of the feature space, softening the issue of imbalanced attributes distribution in dataset and flexibility in term of preserving certain causal attributes interdependencies can be achieved. Comprehensive experiments on various realistic datasets, such as PA100k, PETAzs and RAPzs, validate the efficacy and a spectrum of superiorities of our method.\u0000","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"70 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143653345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multi-Text Guidance Is Important: Multi-Modality Image Fusion via Large Generative Vision-Language Model 多文本引导很重要：基于大型生成视觉语言模型的多模态图像融合

IF 19.5 2区计算机科学

International Journal of Computer Vision Pub Date : 2025-03-17 DOI: 10.1007/s11263-025-02409-3

Zeyu Wang, Libo Zhao, Jizheng Zhang, Rui Song, Haiyu Song, Jiana Meng, Shidong Wang

{"title":"Multi-Text Guidance Is Important: Multi-Modality Image Fusion via Large Generative Vision-Language Model","authors":"Zeyu Wang, Libo Zhao, Jizheng Zhang, Rui Song, Haiyu Song, Jiana Meng, Shidong Wang","doi":"10.1007/s11263-025-02409-3","DOIUrl":"https://doi.org/10.1007/s11263-025-02409-3","url":null,"abstract":"Multi-modality image fusion aims to extract complementary features from multiple source images of different modalities, generating a fused image that inherits their advantages. To address challenges in cross-modality shared feature (CMSF) extraction, single-modality specific feature (SMSF) fusion, and the absence of ground truth (GT) images, we propose MTG-Fusion, a multi-text guided model. We leverage the capabilities of large vision-language models to generate text descriptions tailored to the input images, providing novel insights for these challenges. Our model introduces a text-guided CMSF extractor (TGCE) and a text-guided SMSF fusion module (TGSF). TGCE transforms visual features into the text domain using manifold-isometric domain transform techniques and provides effective visual-text interaction based on text-vision and text-text distances. TGSF fuses each dimension of visual features with corresponding text features, creating a weight matrix utilized for SMSF fusion. We also incorporate the constructed textual GT into the loss function for collaborative training. Extensive experiments demonstrate that MTG-Fusion achieves state-of-the-art performance on infrared and visible image fusion and medical image fusion tasks. The code is available at: https://github.com/zhaolb4080/MTG-Fusion.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"90 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143640778","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Not All Pixels are Equal: Learning Pixel Hardness for Semantic Segmentation 并非所有像素都是相等的：学习语义分割的像素硬度

IF 19.5 2区计算机科学

International Journal of Computer Vision Pub Date : 2025-03-17 DOI: 10.1007/s11263-025-02416-4

Xin Xiao, Daiguo Zhou, Jiagao Hu, Yi Hu, Yongchao Xu

{"title":"Not All Pixels are Equal: Learning Pixel Hardness for Semantic Segmentation","authors":"Xin Xiao, Daiguo Zhou, Jiagao Hu, Yi Hu, Yongchao Xu","doi":"10.1007/s11263-025-02416-4","DOIUrl":"https://doi.org/10.1007/s11263-025-02416-4","url":null,"abstract":"Semantic segmentation has witnessed great progress. Despite the impressive overall results, the segmentation performance in some hard areas (e.g., small objects or thin parts) is still not promising. A straightforward solution is hard sample mining. Yet, most existing hard pixel mining strategies for semantic segmentation often rely on pixel’s loss value, which tends to decrease during training. Intuitively, the pixel hardness for segmentation mainly depends on image structure and is expected to be stable. In this paper, we propose to learn pixel hardness for semantic segmentation by leveraging hardness information contained in global and historical loss values. More precisely, we add a gradient-independent branch for learning a hardness level (HL) map by maximizing hardness-weighted segmentation loss, which is minimized for the segmentation head. This encourages large hardness values in difficult areas, leading to appropriate and stable HL map. Despite its simplicity, the proposed method can be applied to most segmentation methods with no and marginal extra cost during inference and training, respectively. Without bells and whistles, the proposed method achieves consistent improvement (1.37% mIoU on average) over most popular semantic segmentation methods on the Cityscapes dataset, and demonstrates good generalization ability across domains. The source codes are available at this link.\u0000","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"69 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143640777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multi-Source Domain Adaptation by Causal-Guided Adaptive Multimodal Diffusion Networks 通过因果引导的自适应多模态扩散网络实现多源领域自适应

IF 19.5 2区计算机科学

International Journal of Computer Vision Pub Date : 2025-03-15 DOI: 10.1007/s11263-025-02401-x

Ziyun Cai, Yawen Huang, Tengfei Zhang, Yefeng Zheng, Dong Yue

{"title":"Multi-Source Domain Adaptation by Causal-Guided Adaptive Multimodal Diffusion Networks","authors":"Ziyun Cai, Yawen Huang, Tengfei Zhang, Yefeng Zheng, Dong Yue","doi":"10.1007/s11263-025-02401-x","DOIUrl":"https://doi.org/10.1007/s11263-025-02401-x","url":null,"abstract":"Multi-source domain adaptation (MSDA) strives to adapt the models trained on multimodal labelled source domains to an unlabelled target domain. Recent GANs based MSDA methods implicitly characterize the image distribution, which may result in limited sample fidelity, causing misalignment of pixel-level information among sources and the target. Furthermore, when samples from different sources interfere during the learning process, significant misalignment across different source domains may arise. In this paper, we propose a novel MSDA framework, called Causal-guided Adaptive Multimodal Diffusion Networks (C-AMDN), to tackle these challenges. C-AMDN incorporates a diffusive adversarial generation model for high-fidelity, efficient adaptation among source and target domains, along with deep causal inference re-weighting mechanism for the decision-making process that the conditional distributions of outcomes remain consistent across different domains, even as the input distributions change. In addition, we propose an efficient way to further adapt the input image to another domain: we preserve important semantic information by a density constraint regularization in the generation model. Experimental results demonstrate that C-AMDN significantly outperforms existing methods across several real-world domain adaptation benchmarks.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"89 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143627758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Expressive Image Generation and Editing with Rich Text 富有表现力的图像生成和编辑与富文本

IF 19.5 2区计算机科学

International Journal of Computer Vision Pub Date : 2025-03-14 DOI: 10.1007/s11263-025-02361-2

Songwei Ge, Taesung Park, Jun-Yan Zhu, Jia-Bin Huang

{"title":"Expressive Image Generation and Editing with Rich Text","authors":"Songwei Ge, Taesung Park, Jun-Yan Zhu, Jia-Bin Huang","doi":"10.1007/s11263-025-02361-2","DOIUrl":"https://doi.org/10.1007/s11263-025-02361-2","url":null,"abstract":"Plain text has become a prevalent interface for text-based image synthesis and editing. Its limited customization options, however, hinder users from accurately describing desired outputs. For example, plain text makes it hard to specify continuous quantities, such as the precise RGB color value or importance of each word. Creating detailed text prompts for complex scenes is tedious for humans to write and challenging for text encoders to interpret. Furthermore, describing a reference concept or texture in plain text is non-trivial. To address these challenges, we propose using a rich-text editor supporting formats such as font style, size, color, texture fill, footnote, and embedded image. We extract each word’s attributes from rich text to enable local style control, explicit token reweighting, precise color rendering, and detailed region synthesis with reference concepts or texture. We achieve these capabilities through a region-based diffusion process. We first obtain each word’s mask that characterizes the region guided by the word. For each region, we enforce its text attributes by creating customized prompts, applying guidance within the region, and maintaining its fidelity against plain-text generations or input images through region-based injections. We present various examples of image generation and editing from rich text and demonstrate that our method outperforms strong baselines with quantitative evaluations.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"60 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143618571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Parameter Efficient Fine-Tuning for Multi-modal Generative Vision Models with Möbius-Inspired Transformation 基于Möbius-Inspired变换的多模态生成视觉模型参数高效微调

IF 19.5 2区计算机科学

International Journal of Computer Vision Pub Date : 2025-03-13 DOI: 10.1007/s11263-025-02398-3

Haoran Duan, Shuai Shao, Bing Zhai, Tejal Shah, Jungong Han, Rajiv Ranjan

{"title":"Parameter Efficient Fine-Tuning for Multi-modal Generative Vision Models with Möbius-Inspired Transformation","authors":"Haoran Duan, Shuai Shao, Bing Zhai, Tejal Shah, Jungong Han, Rajiv Ranjan","doi":"10.1007/s11263-025-02398-3","DOIUrl":"https://doi.org/10.1007/s11263-025-02398-3","url":null,"abstract":"The rapid development of multimodal generative vision models has drawn scientific curiosity. Notable advancements, such as OpenAI’s ChatGPT and Stable Diffusion, demonstrate the potential of combining multimodal data for generative content. Nonetheless, customising these models to specific domains or tasks is challenging due to computational costs and data requirements. Conventional fine-tuning methods take redundant processing resources, motivating the development of parameter-efficient fine-tuning technologies such as adapter module, low-rank factorization and orthogonal fine-tuning. These solutions selectively change a subset of model parameters, reducing learning needs while maintaining high-quality results. Orthogonal fine-tuning, regarded as a reliable technique, preserves semantic linkages in weight space but has limitations in its expressive powers. To better overcome these constraints, we provide a simple but innovative and effective transformation method inspired by Möbius geometry, which replaces conventional orthogonal transformations in parameter-efficient fine-tuning. This strategy improved fine-tuning’s adaptability and expressiveness, allowing it to capture more data patterns. Our strategy, which is supported by theoretical understanding and empirical validation, outperforms existing approaches, demonstrating competitive improvements in generation quality for key generative tasks.\u0000","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"16 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143618570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Exemplar-Free Continual Learning of Vision Transformers via Gated Class-Attention and Cascaded Feature Drift Compensation 基于门控类注意和级联特征漂移补偿的视觉变压器无样例持续学习

IF 19.5 2区计算机科学

International Journal of Computer Vision Pub Date : 2025-03-13 DOI: 10.1007/s11263-025-02374-x

Marco Cotogni, Fei Yang, Claudio Cusano, Andrew D. Bagdanov, Joost van de Weijer

{"title":"Exemplar-Free Continual Learning of Vision Transformers via Gated Class-Attention and Cascaded Feature Drift Compensation","authors":"Marco Cotogni, Fei Yang, Claudio Cusano, Andrew D. Bagdanov, Joost van de Weijer","doi":"10.1007/s11263-025-02374-x","DOIUrl":"https://doi.org/10.1007/s11263-025-02374-x","url":null,"abstract":"Vision transformers (ViTs) have achieved remarkable successes across a broad range of computer vision applications. As a consequence, there has been increasing interest in extending continual learning theory and techniques to ViT architectures. We propose a new method for exemplar-free class incremental training of ViTs. The main challenge of exemplar-free continual learning is maintaining plasticity of the learner without causing catastrophic forgetting of previously learned tasks. This is often achieved via exemplar replay which can help recalibrate previous task classifiers to the feature drift which occurs when learning new tasks. Exemplar replay, however, comes at the cost of retaining samples from previous tasks which for many applications may not be possible. To address the problem of continual ViT training, we first propose gated class-attention to minimize the drift in the final ViT transformer block. This mask-based gating is applied to class-attention mechanism of the last transformer block and strongly regulates the weights crucial for previous tasks. Importantly, gated class-attention does not require the task-ID during inference, which distinguishes it from other parameter isolation methods. Secondly, we propose a new method of feature drift compensation that accommodates feature drift in the backbone when learning new tasks. The combination of gated class-attention and cascaded feature drift compensation allows for plasticity towards new tasks while limiting forgetting of previous ones. Extensive experiments performed on CIFAR-100, Tiny-ImageNet and ImageNet100 demonstrate that our exemplar-free method obtains competitive results when compared to rehearsal based ViT methods.(Code:https://github.com/OcraM17/GCAB-CFDC)","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"21 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143608063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0