在对齐之前解耦:视觉解耦增强了提示调谐

IF 18.6
Fei Zhang;Tianfei Zhou;Jiangchao Yao;Ya Zhang;Ivor W. Tsang;Yanfeng Wang
{"title":"在对齐之前解耦:视觉解耦增强了提示调谐","authors":"Fei Zhang;Tianfei Zhou;Jiangchao Yao;Ya Zhang;Ivor W. Tsang;Yanfeng Wang","doi":"10.1109/TPAMI.2025.3594894","DOIUrl":null,"url":null,"abstract":"<italic>P</i>rompt tuning (PT), as an emerging resource-efficient fine-tuning paradigm, has showcased remarkable effectiveness in improving the task-specific transferability of <italic>vision-language models</i>. This paper delves into a previously overlooked <italic>information asymmetry</i> issue in PT, where the visual modality mostly conveys more context than the object-oriented textual modality. Correspondingly, coarsely aligning these two modalities could result in the <italic>biased attention</i>, driving the model to merely focus on the context area. To address this, we propose DAPT, an effective PT framework based on an intuitive <italic>decouple-before-align</i> concept. First, we propose to explicitly decouple the visual modality into the foreground and background representation via exploiting coarse-and-fine visual segmenting cues, and then both of these decoupled patterns are aligned with the original foreground texts and the hand-crafted background classes, thereby symmetrically strengthening the modal alignment. To further enhance the visual concentration, we propose a visual pull-push regularization tailored for the foreground-background patterns, directing the original visual representation towards unbiased attention on the <italic>region-of-interest</i> object. We demonstrate the power of architecture-free DAPT through <italic>few-shot learning</i>, <italic>base-to-novel generalization</i>, and <italic>data-efficient learning</i>, all of which yield superior performance across prevailing benchmarks.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 11","pages":"10619-10632"},"PeriodicalIF":18.6000,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Decouple Before Align: Visual Disentanglement Enhances Prompt Tuning\",\"authors\":\"Fei Zhang;Tianfei Zhou;Jiangchao Yao;Ya Zhang;Ivor W. Tsang;Yanfeng Wang\",\"doi\":\"10.1109/TPAMI.2025.3594894\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<italic>P</i>rompt tuning (PT), as an emerging resource-efficient fine-tuning paradigm, has showcased remarkable effectiveness in improving the task-specific transferability of <italic>vision-language models</i>. This paper delves into a previously overlooked <italic>information asymmetry</i> issue in PT, where the visual modality mostly conveys more context than the object-oriented textual modality. Correspondingly, coarsely aligning these two modalities could result in the <italic>biased attention</i>, driving the model to merely focus on the context area. To address this, we propose DAPT, an effective PT framework based on an intuitive <italic>decouple-before-align</i> concept. First, we propose to explicitly decouple the visual modality into the foreground and background representation via exploiting coarse-and-fine visual segmenting cues, and then both of these decoupled patterns are aligned with the original foreground texts and the hand-crafted background classes, thereby symmetrically strengthening the modal alignment. To further enhance the visual concentration, we propose a visual pull-push regularization tailored for the foreground-background patterns, directing the original visual representation towards unbiased attention on the <italic>region-of-interest</i> object. We demonstrate the power of architecture-free DAPT through <italic>few-shot learning</i>, <italic>base-to-novel generalization</i>, and <italic>data-efficient learning</i>, all of which yield superior performance across prevailing benchmarks.\",\"PeriodicalId\":94034,\"journal\":{\"name\":\"IEEE transactions on pattern analysis and machine intelligence\",\"volume\":\"47 11\",\"pages\":\"10619-10632\"},\"PeriodicalIF\":18.6000,\"publicationDate\":\"2025-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on pattern analysis and machine intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11106768/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11106768/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

提示调优(Prompt tuning, PT)作为一种新兴的资源高效调优范式,在提高视觉语言模型的特定任务可移植性方面显示出显著的效果。本文深入研究了PT中以前被忽视的信息不对称问题,其中视觉模态比面向对象的文本模态更能传达上下文。相应地,粗略地调整这两种模式可能会导致有偏见的注意力,导致模型只关注上下文区域。为了解决这个问题,我们提出了DAPT,这是一个基于直观的对齐前解耦概念的有效的PT框架。首先,我们提出利用粗、精视觉分割线索将视觉模态显式解耦为前景和背景表示,然后将这些解耦的模式与原始前景文本和手工制作的背景类对齐,从而对称地加强模态对齐。为了进一步增强视觉集中,我们提出了一种针对前景-背景模式的视觉拉-推正则化方法,将原始视觉表征引导到对感兴趣区域对象的无偏关注上。我们通过少量学习、从基础到新颖的泛化和数据高效学习来展示无架构DAPT的强大功能,所有这些都能在主流基准测试中产生卓越的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Decouple Before Align: Visual Disentanglement Enhances Prompt Tuning
Prompt tuning (PT), as an emerging resource-efficient fine-tuning paradigm, has showcased remarkable effectiveness in improving the task-specific transferability of vision-language models. This paper delves into a previously overlooked information asymmetry issue in PT, where the visual modality mostly conveys more context than the object-oriented textual modality. Correspondingly, coarsely aligning these two modalities could result in the biased attention, driving the model to merely focus on the context area. To address this, we propose DAPT, an effective PT framework based on an intuitive decouple-before-align concept. First, we propose to explicitly decouple the visual modality into the foreground and background representation via exploiting coarse-and-fine visual segmenting cues, and then both of these decoupled patterns are aligned with the original foreground texts and the hand-crafted background classes, thereby symmetrically strengthening the modal alignment. To further enhance the visual concentration, we propose a visual pull-push regularization tailored for the foreground-background patterns, directing the original visual representation towards unbiased attention on the region-of-interest object. We demonstrate the power of architecture-free DAPT through few-shot learning, base-to-novel generalization, and data-efficient learning, all of which yield superior performance across prevailing benchmarks.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信