用于视觉语言提示调优的视觉残差聚合网络

IF 3.5 2区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Yunqian Yu, Feng Guo, Xianlong Tian, Biao Chen, Mengmeng Jing, Lin Zuo
{"title":"用于视觉语言提示调优的视觉残差聚合网络","authors":"Yunqian Yu,&nbsp;Feng Guo,&nbsp;Xianlong Tian,&nbsp;Biao Chen,&nbsp;Mengmeng Jing,&nbsp;Lin Zuo","doi":"10.1007/s10489-025-06866-8","DOIUrl":null,"url":null,"abstract":"<div><p>Prompt tuning leverages a series of learnable prompts to effectively guide pre-trained visual language models (VLMs) to adapt to various downstream tasks. VLMs encode deep features from both visual and textual branches and learn the joint embedding space of the two modalities by optimizing the contrast loss. However, existing prompt tuning methods face two critical challenges: (1) One challenge is the forgetting of generalized knowledge. As features propagate through the visual encoder, generalizable knowledge captured in shallow layers is gradually lost, ultimately impairing the generalization ability of the joint embedding space for new classes. (2) The other challenge is that models trained on the base class suffer from semantic bias. To address these issues, we propose <b><u>V</u></b>isual <b><u>R</u></b>esidual <b><u>A</u></b>ggregation Network for Visual-Language <b><u>P</u></b>rompt <b><u>T</u></b>uning (VraPT). VraPT comprises two sequentially connected components: a residual aggregation module and a semantic consistency module. Firstly, in order to solve the problem of generalized knowledge forgetting, the residual aggregation module enables adaptive fusion of generalized features, which effectively preserves generalized knowledge. It also reveals the importance of shallow features in enhancing the generalization capability of text prompts. The fused representation is then fed into the semantic consistency module which is used to address the problem of semantic bias. By minimizing the divergence from the true semantic distribution, this module enhances the semantic representations in the visual space as well as the semantic coherence of the learnable prompts. Our method enables the learned prompts to retain both discriminative semantic information and generalized knowledge. Extensive experiments show that our proposed VraPT is an effective prompt tuning method, especially in recognizing new classes with great improvement. On average, VraPT improves the accuracy on base classes by 1.06% and on new classes by 2.63% across 11 datasets, along with a 1.91% gain in the harmonic mean (H) metric.</p></div>","PeriodicalId":8041,"journal":{"name":"Applied Intelligence","volume":"55 15","pages":""},"PeriodicalIF":3.5000,"publicationDate":"2025-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Visual residual aggregation network for visual-language prompt tuning\",\"authors\":\"Yunqian Yu,&nbsp;Feng Guo,&nbsp;Xianlong Tian,&nbsp;Biao Chen,&nbsp;Mengmeng Jing,&nbsp;Lin Zuo\",\"doi\":\"10.1007/s10489-025-06866-8\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Prompt tuning leverages a series of learnable prompts to effectively guide pre-trained visual language models (VLMs) to adapt to various downstream tasks. VLMs encode deep features from both visual and textual branches and learn the joint embedding space of the two modalities by optimizing the contrast loss. However, existing prompt tuning methods face two critical challenges: (1) One challenge is the forgetting of generalized knowledge. As features propagate through the visual encoder, generalizable knowledge captured in shallow layers is gradually lost, ultimately impairing the generalization ability of the joint embedding space for new classes. (2) The other challenge is that models trained on the base class suffer from semantic bias. To address these issues, we propose <b><u>V</u></b>isual <b><u>R</u></b>esidual <b><u>A</u></b>ggregation Network for Visual-Language <b><u>P</u></b>rompt <b><u>T</u></b>uning (VraPT). VraPT comprises two sequentially connected components: a residual aggregation module and a semantic consistency module. Firstly, in order to solve the problem of generalized knowledge forgetting, the residual aggregation module enables adaptive fusion of generalized features, which effectively preserves generalized knowledge. It also reveals the importance of shallow features in enhancing the generalization capability of text prompts. The fused representation is then fed into the semantic consistency module which is used to address the problem of semantic bias. By minimizing the divergence from the true semantic distribution, this module enhances the semantic representations in the visual space as well as the semantic coherence of the learnable prompts. Our method enables the learned prompts to retain both discriminative semantic information and generalized knowledge. Extensive experiments show that our proposed VraPT is an effective prompt tuning method, especially in recognizing new classes with great improvement. On average, VraPT improves the accuracy on base classes by 1.06% and on new classes by 2.63% across 11 datasets, along with a 1.91% gain in the harmonic mean (H) metric.</p></div>\",\"PeriodicalId\":8041,\"journal\":{\"name\":\"Applied Intelligence\",\"volume\":\"55 15\",\"pages\":\"\"},\"PeriodicalIF\":3.5000,\"publicationDate\":\"2025-09-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Applied Intelligence\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://link.springer.com/article/10.1007/s10489-025-06866-8\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Intelligence","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10489-025-06866-8","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

提示调优利用一系列可学习的提示来有效地指导预训练的视觉语言模型(vlm)适应各种下游任务。VLMs从视觉和文本两个分支编码深度特征,并通过优化对比度损失来学习两种模式的联合嵌入空间。然而,现有的提示调优方法面临两个关键的挑战:(1)一个挑战是泛化知识的遗忘。随着特征在视觉编码器中传播,浅层捕获的可泛化知识逐渐丢失,最终削弱了新类联合嵌入空间的泛化能力。(2)另一个挑战是在基类上训练的模型存在语义偏差。为了解决这些问题,我们提出了用于视觉语言提示调优的视觉残差聚合网络(VraPT)。VraPT由两个按顺序连接的组件组成:残差聚合模块和语义一致性模块。首先,为了解决广义知识遗忘问题,残差聚合模块实现了广义特征的自适应融合,有效地保留了广义知识;揭示了浅特征对提高文本提示泛化能力的重要性。然后将融合后的表示输入到语义一致性模块中,该模块用于解决语义偏差问题。该模块通过最小化与真实语义分布的偏离,增强了视觉空间中的语义表示以及可学习提示的语义连贯性。我们的方法使学习到的提示既保留了判别语义信息,又保留了广义知识。大量的实验表明,我们提出的VraPT是一种有效的快速调优方法,特别是在识别新类别方面有很大的改进。平均而言,VraPT在11个数据集中将基本类的准确率提高了1.06%,新类的准确率提高了2.63%,谐波平均(H)度量的准确率提高了1.91%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Visual residual aggregation network for visual-language prompt tuning

Prompt tuning leverages a series of learnable prompts to effectively guide pre-trained visual language models (VLMs) to adapt to various downstream tasks. VLMs encode deep features from both visual and textual branches and learn the joint embedding space of the two modalities by optimizing the contrast loss. However, existing prompt tuning methods face two critical challenges: (1) One challenge is the forgetting of generalized knowledge. As features propagate through the visual encoder, generalizable knowledge captured in shallow layers is gradually lost, ultimately impairing the generalization ability of the joint embedding space for new classes. (2) The other challenge is that models trained on the base class suffer from semantic bias. To address these issues, we propose Visual Residual Aggregation Network for Visual-Language Prompt Tuning (VraPT). VraPT comprises two sequentially connected components: a residual aggregation module and a semantic consistency module. Firstly, in order to solve the problem of generalized knowledge forgetting, the residual aggregation module enables adaptive fusion of generalized features, which effectively preserves generalized knowledge. It also reveals the importance of shallow features in enhancing the generalization capability of text prompts. The fused representation is then fed into the semantic consistency module which is used to address the problem of semantic bias. By minimizing the divergence from the true semantic distribution, this module enhances the semantic representations in the visual space as well as the semantic coherence of the learnable prompts. Our method enables the learned prompts to retain both discriminative semantic information and generalized knowledge. Extensive experiments show that our proposed VraPT is an effective prompt tuning method, especially in recognizing new classes with great improvement. On average, VraPT improves the accuracy on base classes by 1.06% and on new classes by 2.63% across 11 datasets, along with a 1.91% gain in the harmonic mean (H) metric.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Applied Intelligence
Applied Intelligence 工程技术-计算机:人工智能
CiteScore
6.60
自引率
20.80%
发文量
1361
审稿时长
5.9 months
期刊介绍: With a focus on research in artificial intelligence and neural networks, this journal addresses issues involving solutions of real-life manufacturing, defense, management, government and industrial problems which are too complex to be solved through conventional approaches and require the simulation of intelligent thought processes, heuristics, applications of knowledge, and distributed and parallel processing. The integration of these multiple approaches in solving complex problems is of particular importance. The journal presents new and original research and technological developments, addressing real and complex issues applicable to difficult problems. It provides a medium for exchanging scientific research and technological achievements accomplished by the international community.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信