基于频率的视觉语言模型综合提示学习。

IF 18.6

IEEE transactions on pattern analysis and machine intelligence Pub Date : 2025-08-19 DOI:10.1109/TPAMI.2025.3599830

Liangchen Liu, Nannan Wang, Chen Chen, Decheng Liu, Xi Yang, Xinbo Gao, Tongliang Liu

{"title":"基于频率的视觉语言模型综合提示学习。","authors":"Liangchen Liu, Nannan Wang, Chen Chen, Decheng Liu, Xi Yang, Xinbo Gao, Tongliang Liu","doi":"10.1109/TPAMI.2025.3599830","DOIUrl":null,"url":null,"abstract":"This paper targets to learn multiple comprehensive text prompts that can describe the visual concepts from coarse to fine, thereby endowing pre-trained VLMs with better transfer ability to various downstream tasks. We focus on exploring this idea on transformer-based VLMs since this kind of architecture achieves more compelling performances than CNN-based ones. Unfortunately, unlike CNNs, the transformer-based visual encoder of pre-trained VLMs cannot naturally provide discriminative and representative local visual information. To solve this problem, we propose Frequency-based Comprehensive Prompt Learning (FCPrompt) to excavate representative local visual information from the redundant output features of the visual encoder. FCPrompt transforms these features into frequency domain via Discrete Cosine Transform (DCT). Taking the advantages of energy concentration and information orthogonality of DCT, we can obtain compact, informative and disentangled local visual information by leveraging specific frequency components of the transformed frequency features. To better fit with transformer architectures, FCPrompt further adopts and optimizes different text prompts to respectively align with the global and frequency-based local visual information via a dual-branch framework. Finally, the learned text prompts can thus describe the entire visual concepts from coarse to fine comprehensively. Extensive experiments indicate that FCPrompt achieves the state-of-the-art performances on various benchmarks. Code is available at https://github.com/llcllc1997/FCPrompt.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6000,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Frequency-Based Comprehensive Prompt Learning for Vision-Language Models.\",\"authors\":\"Liangchen Liu, Nannan Wang, Chen Chen, Decheng Liu, Xi Yang, Xinbo Gao, Tongliang Liu\",\"doi\":\"10.1109/TPAMI.2025.3599830\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper targets to learn multiple comprehensive text prompts that can describe the visual concepts from coarse to fine, thereby endowing pre-trained VLMs with better transfer ability to various downstream tasks. We focus on exploring this idea on transformer-based VLMs since this kind of architecture achieves more compelling performances than CNN-based ones. Unfortunately, unlike CNNs, the transformer-based visual encoder of pre-trained VLMs cannot naturally provide discriminative and representative local visual information. To solve this problem, we propose Frequency-based Comprehensive Prompt Learning (FCPrompt) to excavate representative local visual information from the redundant output features of the visual encoder. FCPrompt transforms these features into frequency domain via Discrete Cosine Transform (DCT). Taking the advantages of energy concentration and information orthogonality of DCT, we can obtain compact, informative and disentangled local visual information by leveraging specific frequency components of the transformed frequency features. To better fit with transformer architectures, FCPrompt further adopts and optimizes different text prompts to respectively align with the global and frequency-based local visual information via a dual-branch framework. Finally, the learned text prompts can thus describe the entire visual concepts from coarse to fine comprehensively. Extensive experiments indicate that FCPrompt achieves the state-of-the-art performances on various benchmarks. Code is available at https://github.com/llcllc1997/FCPrompt.\",\"PeriodicalId\":94034,\"journal\":{\"name\":\"IEEE transactions on pattern analysis and machine intelligence\",\"volume\":\"PP \",\"pages\":\"\"},\"PeriodicalIF\":18.6000,\"publicationDate\":\"2025-08-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on pattern analysis and machine intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/TPAMI.2025.3599830\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TPAMI.2025.3599830","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

本文的目标是学习多个能够从粗到细描述视觉概念的综合文本提示，从而使预训练的vlm具有更好的向各种下游任务迁移的能力。我们专注于在基于变压器的vlm上探索这个想法，因为这种架构比基于cnn的架构实现了更引人注目的性能。不幸的是，与cnn不同，预训练vlm的基于变压器的视觉编码器不能自然地提供判别性和代表性的局部视觉信息。为了解决这个问题，我们提出了基于频率的综合提示学习（FCPrompt），从视觉编码器的冗余输出特征中挖掘具有代表性的局部视觉信息。FCPrompt通过离散余弦变换（DCT）将这些特征转换到频域。利用DCT的能量集中和信息正交性，利用变换后的频率特征的特定频率分量，可以得到紧凑、信息量大、解纠缠的局部视觉信息。为了更好地适应变压器架构，FCPrompt进一步采用并优化了不同的文本提示，通过双分支框架分别与全局和基于频率的局部视觉信息对齐。最后，学习到的文本提示可以从粗到细全面地描述整个视觉概念。大量的实验表明，FCPrompt在各种基准测试中都达到了最先进的性能。代码可从https://github.com/llcllc1997/FCPrompt获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Frequency-Based Comprehensive Prompt Learning for Vision-Language Models.

This paper targets to learn multiple comprehensive text prompts that can describe the visual concepts from coarse to fine, thereby endowing pre-trained VLMs with better transfer ability to various downstream tasks. We focus on exploring this idea on transformer-based VLMs since this kind of architecture achieves more compelling performances than CNN-based ones. Unfortunately, unlike CNNs, the transformer-based visual encoder of pre-trained VLMs cannot naturally provide discriminative and representative local visual information. To solve this problem, we propose Frequency-based Comprehensive Prompt Learning (FCPrompt) to excavate representative local visual information from the redundant output features of the visual encoder. FCPrompt transforms these features into frequency domain via Discrete Cosine Transform (DCT). Taking the advantages of energy concentration and information orthogonality of DCT, we can obtain compact, informative and disentangled local visual information by leveraging specific frequency components of the transformed frequency features. To better fit with transformer architectures, FCPrompt further adopts and optimizes different text prompts to respectively align with the global and frequency-based local visual information via a dual-branch framework. Finally, the learned text prompts can thus describe the entire visual concepts from coarse to fine comprehensively. Extensive experiments indicate that FCPrompt achieves the state-of-the-art performances on various benchmarks. Code is available at https://github.com/llcllc1997/FCPrompt.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE transactions on pattern analysis and machine intelligence

自引率

0.00%

发文量