基于频率的视觉语言模型综合提示学习。

IF 18.6
Liangchen Liu, Nannan Wang, Chen Chen, Decheng Liu, Xi Yang, Xinbo Gao, Tongliang Liu
{"title":"基于频率的视觉语言模型综合提示学习。","authors":"Liangchen Liu, Nannan Wang, Chen Chen, Decheng Liu, Xi Yang, Xinbo Gao, Tongliang Liu","doi":"10.1109/TPAMI.2025.3599830","DOIUrl":null,"url":null,"abstract":"<p><p>This paper targets to learn multiple comprehensive text prompts that can describe the visual concepts from coarse to fine, thereby endowing pre-trained VLMs with better transfer ability to various downstream tasks. We focus on exploring this idea on transformer-based VLMs since this kind of architecture achieves more compelling performances than CNN-based ones. Unfortunately, unlike CNNs, the transformer-based visual encoder of pre-trained VLMs cannot naturally provide discriminative and representative local visual information. To solve this problem, we propose Frequency-based Comprehensive Prompt Learning (FCPrompt) to excavate representative local visual information from the redundant output features of the visual encoder. FCPrompt transforms these features into frequency domain via Discrete Cosine Transform (DCT). Taking the advantages of energy concentration and information orthogonality of DCT, we can obtain compact, informative and disentangled local visual information by leveraging specific frequency components of the transformed frequency features. To better fit with transformer architectures, FCPrompt further adopts and optimizes different text prompts to respectively align with the global and frequency-based local visual information via a dual-branch framework. Finally, the learned text prompts can thus describe the entire visual concepts from coarse to fine comprehensively. Extensive experiments indicate that FCPrompt achieves the state-of-the-art performances on various benchmarks. Code is available at https://github.com/llcllc1997/FCPrompt.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6000,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Frequency-Based Comprehensive Prompt Learning for Vision-Language Models.\",\"authors\":\"Liangchen Liu, Nannan Wang, Chen Chen, Decheng Liu, Xi Yang, Xinbo Gao, Tongliang Liu\",\"doi\":\"10.1109/TPAMI.2025.3599830\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>This paper targets to learn multiple comprehensive text prompts that can describe the visual concepts from coarse to fine, thereby endowing pre-trained VLMs with better transfer ability to various downstream tasks. We focus on exploring this idea on transformer-based VLMs since this kind of architecture achieves more compelling performances than CNN-based ones. Unfortunately, unlike CNNs, the transformer-based visual encoder of pre-trained VLMs cannot naturally provide discriminative and representative local visual information. To solve this problem, we propose Frequency-based Comprehensive Prompt Learning (FCPrompt) to excavate representative local visual information from the redundant output features of the visual encoder. FCPrompt transforms these features into frequency domain via Discrete Cosine Transform (DCT). Taking the advantages of energy concentration and information orthogonality of DCT, we can obtain compact, informative and disentangled local visual information by leveraging specific frequency components of the transformed frequency features. To better fit with transformer architectures, FCPrompt further adopts and optimizes different text prompts to respectively align with the global and frequency-based local visual information via a dual-branch framework. Finally, the learned text prompts can thus describe the entire visual concepts from coarse to fine comprehensively. Extensive experiments indicate that FCPrompt achieves the state-of-the-art performances on various benchmarks. Code is available at https://github.com/llcllc1997/FCPrompt.</p>\",\"PeriodicalId\":94034,\"journal\":{\"name\":\"IEEE transactions on pattern analysis and machine intelligence\",\"volume\":\"PP \",\"pages\":\"\"},\"PeriodicalIF\":18.6000,\"publicationDate\":\"2025-08-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on pattern analysis and machine intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/TPAMI.2025.3599830\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TPAMI.2025.3599830","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

本文的目标是学习多个能够从粗到细描述视觉概念的综合文本提示,从而使预训练的vlm具有更好的向各种下游任务迁移的能力。我们专注于在基于变压器的vlm上探索这个想法,因为这种架构比基于cnn的架构实现了更引人注目的性能。不幸的是,与cnn不同,预训练vlm的基于变压器的视觉编码器不能自然地提供判别性和代表性的局部视觉信息。为了解决这个问题,我们提出了基于频率的综合提示学习(FCPrompt),从视觉编码器的冗余输出特征中挖掘具有代表性的局部视觉信息。FCPrompt通过离散余弦变换(DCT)将这些特征转换到频域。利用DCT的能量集中和信息正交性,利用变换后的频率特征的特定频率分量,可以得到紧凑、信息量大、解纠缠的局部视觉信息。为了更好地适应变压器架构,FCPrompt进一步采用并优化了不同的文本提示,通过双分支框架分别与全局和基于频率的局部视觉信息对齐。最后,学习到的文本提示可以从粗到细全面地描述整个视觉概念。大量的实验表明,FCPrompt在各种基准测试中都达到了最先进的性能。代码可从https://github.com/llcllc1997/FCPrompt获得。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Frequency-Based Comprehensive Prompt Learning for Vision-Language Models.

This paper targets to learn multiple comprehensive text prompts that can describe the visual concepts from coarse to fine, thereby endowing pre-trained VLMs with better transfer ability to various downstream tasks. We focus on exploring this idea on transformer-based VLMs since this kind of architecture achieves more compelling performances than CNN-based ones. Unfortunately, unlike CNNs, the transformer-based visual encoder of pre-trained VLMs cannot naturally provide discriminative and representative local visual information. To solve this problem, we propose Frequency-based Comprehensive Prompt Learning (FCPrompt) to excavate representative local visual information from the redundant output features of the visual encoder. FCPrompt transforms these features into frequency domain via Discrete Cosine Transform (DCT). Taking the advantages of energy concentration and information orthogonality of DCT, we can obtain compact, informative and disentangled local visual information by leveraging specific frequency components of the transformed frequency features. To better fit with transformer architectures, FCPrompt further adopts and optimizes different text prompts to respectively align with the global and frequency-based local visual information via a dual-branch framework. Finally, the learned text prompts can thus describe the entire visual concepts from coarse to fine comprehensively. Extensive experiments indicate that FCPrompt achieves the state-of-the-art performances on various benchmarks. Code is available at https://github.com/llcllc1997/FCPrompt.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信