在对齐之前解耦：视觉解耦增强了提示调谐

IF 18.6

IEEE transactions on pattern analysis and machine intelligence Pub Date : 2025-08-01 DOI:10.1109/TPAMI.2025.3594894

Fei Zhang;Tianfei Zhou;Jiangchao Yao;Ya Zhang;Ivor W. Tsang;Yanfeng Wang

{"title":"在对齐之前解耦：视觉解耦增强了提示调谐","authors":"Fei Zhang;Tianfei Zhou;Jiangchao Yao;Ya Zhang;Ivor W. Tsang;Yanfeng Wang","doi":"10.1109/TPAMI.2025.3594894","DOIUrl":null,"url":null,"abstract":"<italic>Prompt tuning (PT), as an emerging resource-efficient fine-tuning paradigm, has showcased remarkable effectiveness in improving the task-specific transferability of <italic>vision-language models. This paper delves into a previously overlooked <italic>information asymmetry issue in PT, where the visual modality mostly conveys more context than the object-oriented textual modality. Correspondingly, coarsely aligning these two modalities could result in the <italic>biased attention, driving the model to merely focus on the context area. To address this, we propose DAPT, an effective PT framework based on an intuitive <italic>decouple-before-align concept. First, we propose to explicitly decouple the visual modality into the foreground and background representation via exploiting coarse-and-fine visual segmenting cues, and then both of these decoupled patterns are aligned with the original foreground texts and the hand-crafted background classes, thereby symmetrically strengthening the modal alignment. To further enhance the visual concentration, we propose a visual pull-push regularization tailored for the foreground-background patterns, directing the original visual representation towards unbiased attention on the <italic>region-of-interest object. We demonstrate the power of architecture-free DAPT through <italic>few-shot learning, <italic>base-to-novel generalization, and <italic>data-efficient learning, all of which yield superior performance across prevailing benchmarks.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 11","pages":"10619-10632"},"PeriodicalIF":18.6000,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Decouple Before Align: Visual Disentanglement Enhances Prompt Tuning\",\"authors\":\"Fei Zhang;Tianfei Zhou;Jiangchao Yao;Ya Zhang;Ivor W. Tsang;Yanfeng Wang\",\"doi\":\"10.1109/TPAMI.2025.3594894\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<italic>Prompt tuning (PT), as an emerging resource-efficient fine-tuning paradigm, has showcased remarkable effectiveness in improving the task-specific transferability of <italic>vision-language models. This paper delves into a previously overlooked <italic>information asymmetry issue in PT, where the visual modality mostly conveys more context than the object-oriented textual modality. Correspondingly, coarsely aligning these two modalities could result in the <italic>biased attention, driving the model to merely focus on the context area. To address this, we propose DAPT, an effective PT framework based on an intuitive <italic>decouple-before-align concept. First, we propose to explicitly decouple the visual modality into the foreground and background representation via exploiting coarse-and-fine visual segmenting cues, and then both of these decoupled patterns are aligned with the original foreground texts and the hand-crafted background classes, thereby symmetrically strengthening the modal alignment. To further enhance the visual concentration, we propose a visual pull-push regularization tailored for the foreground-background patterns, directing the original visual representation towards unbiased attention on the <italic>region-of-interest object. We demonstrate the power of architecture-free DAPT through <italic>few-shot learning, <italic>base-to-novel generalization, and <italic>data-efficient learning, all of which yield superior performance across prevailing benchmarks.\",\"PeriodicalId\":94034,\"journal\":{\"name\":\"IEEE transactions on pattern analysis and machine intelligence\",\"volume\":\"47 11\",\"pages\":\"10619-10632\"},\"PeriodicalIF\":18.6000,\"publicationDate\":\"2025-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on pattern analysis and machine intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11106768/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11106768/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

提示调优（Prompt tuning， PT）作为一种新兴的资源高效调优范式，在提高视觉语言模型的特定任务可移植性方面显示出显著的效果。本文深入研究了PT中以前被忽视的信息不对称问题，其中视觉模态比面向对象的文本模态更能传达上下文。相应地，粗略地调整这两种模式可能会导致有偏见的注意力，导致模型只关注上下文区域。为了解决这个问题，我们提出了DAPT，这是一个基于直观的对齐前解耦概念的有效的PT框架。首先，我们提出利用粗、精视觉分割线索将视觉模态显式解耦为前景和背景表示，然后将这些解耦的模式与原始前景文本和手工制作的背景类对齐，从而对称地加强模态对齐。为了进一步增强视觉集中，我们提出了一种针对前景-背景模式的视觉拉-推正则化方法，将原始视觉表征引导到对感兴趣区域对象的无偏关注上。我们通过少量学习、从基础到新颖的泛化和数据高效学习来展示无架构DAPT的强大功能，所有这些都能在主流基准测试中产生卓越的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Decouple Before Align: Visual Disentanglement Enhances Prompt Tuning

Prompt tuning (PT), as an emerging resource-efficient fine-tuning paradigm, has showcased remarkable effectiveness in improving the task-specific transferability of vision-language models. This paper delves into a previously overlooked information asymmetry issue in PT, where the visual modality mostly conveys more context than the object-oriented textual modality. Correspondingly, coarsely aligning these two modalities could result in the biased attention, driving the model to merely focus on the context area. To address this, we propose DAPT, an effective PT framework based on an intuitive decouple-before-align concept. First, we propose to explicitly decouple the visual modality into the foreground and background representation via exploiting coarse-and-fine visual segmenting cues, and then both of these decoupled patterns are aligned with the original foreground texts and the hand-crafted background classes, thereby symmetrically strengthening the modal alignment. To further enhance the visual concentration, we propose a visual pull-push regularization tailored for the foreground-background patterns, directing the original visual representation towards unbiased attention on the region-of-interest object. We demonstrate the power of architecture-free DAPT through few-shot learning, base-to-novel generalization, and data-efficient learning, all of which yield superior performance across prevailing benchmarks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE transactions on pattern analysis and machine intelligence

自引率

0.00%

发文量