Progressive Visual Prompt Learning with Contrastive Feature Re-formation

IF 11.6 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision Pub Date : 2024-08-06 DOI:10.1007/s11263-024-02172-x

Chen Xu, Yuhan Zhu, Haocheng Shen, Boheng Chen, Yixuan Liao, Xiaoxin Chen, Limin Wang

{"title":"Progressive Visual Prompt Learning with Contrastive Feature Re-formation","authors":"Chen Xu, Yuhan Zhu, Haocheng Shen, Boheng Chen, Yixuan Liao, Xiaoxin Chen, Limin Wang","doi":"10.1007/s11263-024-02172-x","DOIUrl":null,"url":null,"abstract":"Prompt learning has recently emerged as a compelling alternative to the traditional fine-tuning paradigm for adapting the pre-trained Vision-Language (V-L) models to downstream tasks. Drawing inspiration from the success of prompt learning in Natural Language Processing, pioneering research efforts have been predominantly concentrated on text-based prompting strategies. By contrast, the visual prompting within V-L models remains underexploited. The straightforward transposition of existing visual prompt methods, tailored for Vision Transformers (ViT), into the V-L models often leads to suboptimal performance or training instability. To mitigate these challenges, in this paper, we propose a novel structure called Progressive Visual Prompt (ProVP). This design aims to strengthen the interaction among prompts from adjacent layers, thereby enabling more effective propagation of image embeddings to deeper layers in a manner akin to an instance-specific manner. Additionally, to address the common issue of generalization deterioration in the training period of learnable prompts, we further introduce a contrastive feature re-formation technique for visual prompt learning. This method prevents significant deviations of prompted visual features from the fixed CLIP visual feature distribution, ensuring its better generalization capability. Combining the ProVP and the contrastive feature re-formation technique, our proposed method, ProVP-Ref, significantly stabilizes the training process and enhances both the adaptation and generalization capabilities of visual prompt learning in V-L models. To demonstrate the efficacy of our approach, we evaluate ProVP-Ref across 11 image datasets, achieving the state-of-the-art results on 7 of these datasets in both few-shot learning and base-to-new generalization settings. To the best of our knowledge, this is the first study to showcase the exceptional performance of visual prompts in V-L models compared to previous text prompting methods in this area.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"98 1","pages":""},"PeriodicalIF":11.6000,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Computer Vision","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11263-024-02172-x","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Prompt learning has recently emerged as a compelling alternative to the traditional fine-tuning paradigm for adapting the pre-trained Vision-Language (V-L) models to downstream tasks. Drawing inspiration from the success of prompt learning in Natural Language Processing, pioneering research efforts have been predominantly concentrated on text-based prompting strategies. By contrast, the visual prompting within V-L models remains underexploited. The straightforward transposition of existing visual prompt methods, tailored for Vision Transformers (ViT), into the V-L models often leads to suboptimal performance or training instability. To mitigate these challenges, in this paper, we propose a novel structure called Progressive Visual Prompt (ProVP). This design aims to strengthen the interaction among prompts from adjacent layers, thereby enabling more effective propagation of image embeddings to deeper layers in a manner akin to an instance-specific manner. Additionally, to address the common issue of generalization deterioration in the training period of learnable prompts, we further introduce a contrastive feature re-formation technique for visual prompt learning. This method prevents significant deviations of prompted visual features from the fixed CLIP visual feature distribution, ensuring its better generalization capability. Combining the ProVP and the contrastive feature re-formation technique, our proposed method, ProVP-Ref, significantly stabilizes the training process and enhances both the adaptation and generalization capabilities of visual prompt learning in V-L models. To demonstrate the efficacy of our approach, we evaluate ProVP-Ref across 11 image datasets, achieving the state-of-the-art results on 7 of these datasets in both few-shot learning and base-to-new generalization settings. To the best of our knowledge, this is the first study to showcase the exceptional performance of visual prompts in V-L models compared to previous text prompting methods in this area.

Abstract Image

查看原文本刊更多论文

利用对比特征重构进行渐进式视觉提示学习

提示学习是近来出现的一种引人注目的替代传统微调范式的方法，用于将预先训练好的视觉语言（V-L）模型适应下游任务。受自然语言处理中提示学习成功经验的启发，开创性的研究工作主要集中在基于文本的提示策略上。相比之下，V-L 模型中的视觉提示仍未得到充分利用。将为视觉转换器（ViT）量身定制的现有视觉提示方法直接移植到 V-L 模型中，往往会导致性能不理想或训练不稳定。为了缓解这些挑战，我们在本文中提出了一种名为渐进式视觉提示（ProVP）的新结构。这种设计旨在加强相邻层提示之间的互动，从而使图像嵌入以类似于特定实例的方式更有效地传播到更深的层。此外，为了解决可学习提示在训练期间泛化能力下降的常见问题，我们进一步引入了一种用于视觉提示学习的对比特征重构技术。这种方法可以防止提示的视觉特征与固定的 CLIP 视觉特征分布产生明显偏差，从而确保其具有更好的泛化能力。结合 ProVP 和对比特征重构技术，我们提出的 ProVP-Ref 方法能显著稳定训练过程，并增强 V-L 模型中视觉提示学习的适应性和泛化能力。为了证明我们的方法的有效性，我们在 11 个图像数据集上对 ProVP-Ref 进行了评估，在其中 7 个数据集上，我们在少次学习和从基础到新的泛化设置上都取得了最先进的结果。据我们所知，这是第一项在 V-L 模型中展示视觉提示与该领域以前的文本提示方法相比的卓越性能的研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Computer Vision 工程技术-计算机：人工智能

CiteScore

29.80

自引率

2.10%

发文量

163

审稿时长

6 months

期刊介绍： The International Journal of Computer Vision (IJCV) serves as a platform for sharing new research findings in the rapidly growing field of computer vision. It publishes 12 issues annually and presents high-quality, original contributions to the science and engineering of computer vision. The journal encompasses various types of articles to cater to different research outputs. Regular articles, which span up to 25 journal pages, focus on significant technical advancements that are of broad interest to the field. These articles showcase substantial progress in computer vision. Short articles, limited to 10 pages, offer a swift publication path for novel research outcomes. They provide a quicker means for sharing new findings with the computer vision community. Survey articles, comprising up to 30 pages, offer critical evaluations of the current state of the art in computer vision or offer tutorial presentations of relevant topics. These articles provide comprehensive and insightful overviews of specific subject areas. In addition to technical articles, the journal also includes book reviews, position papers, and editorials by prominent scientific figures. These contributions serve to complement the technical content and provide valuable perspectives. The journal encourages authors to include supplementary material online, such as images, video sequences, data sets, and software. This additional material enhances the understanding and reproducibility of the published research. Overall, the International Journal of Computer Vision is a comprehensive publication that caters to researchers in this rapidly growing field. It covers a range of article types, offers additional online resources, and facilitates the dissemination of impactful research.