用于视觉语言提示调优的视觉残差聚合网络

IF 3.5 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Applied Intelligence Pub Date : 2025-09-29 DOI:10.1007/s10489-025-06866-8

Yunqian Yu, Feng Guo, Xianlong Tian, Biao Chen, Mengmeng Jing, Lin Zuo

{"title":"用于视觉语言提示调优的视觉残差聚合网络","authors":"Yunqian Yu, Feng Guo, Xianlong Tian, Biao Chen, Mengmeng Jing, Lin Zuo","doi":"10.1007/s10489-025-06866-8","DOIUrl":null,"url":null,"abstract":"<div>Prompt tuning leverages a series of learnable prompts to effectively guide pre-trained visual language models (VLMs) to adapt to various downstream tasks. VLMs encode deep features from both visual and textual branches and learn the joint embedding space of the two modalities by optimizing the contrast loss. However, existing prompt tuning methods face two critical challenges: (1) One challenge is the forgetting of generalized knowledge. As features propagate through the visual encoder, generalizable knowledge captured in shallow layers is gradually lost, ultimately impairing the generalization ability of the joint embedding space for new classes. (2) The other challenge is that models trained on the base class suffer from semantic bias. To address these issues, we propose Visual Residual Aggregation Network for Visual-Language Prompt Tuning (VraPT). VraPT comprises two sequentially connected components: a residual aggregation module and a semantic consistency module. Firstly, in order to solve the problem of generalized knowledge forgetting, the residual aggregation module enables adaptive fusion of generalized features, which effectively preserves generalized knowledge. It also reveals the importance of shallow features in enhancing the generalization capability of text prompts. The fused representation is then fed into the semantic consistency module which is used to address the problem of semantic bias. By minimizing the divergence from the true semantic distribution, this module enhances the semantic representations in the visual space as well as the semantic coherence of the learnable prompts. Our method enables the learned prompts to retain both discriminative semantic information and generalized knowledge. Extensive experiments show that our proposed VraPT is an effective prompt tuning method, especially in recognizing new classes with great improvement. On average, VraPT improves the accuracy on base classes by 1.06% and on new classes by 2.63% across 11 datasets, along with a 1.91% gain in the harmonic mean (H) metric.</div>","PeriodicalId":8041,"journal":{"name":"Applied Intelligence","volume":"55 15","pages":""},"PeriodicalIF":3.5000,"publicationDate":"2025-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Visual residual aggregation network for visual-language prompt tuning\",\"authors\":\"Yunqian Yu, Feng Guo, Xianlong Tian, Biao Chen, Mengmeng Jing, Lin Zuo\",\"doi\":\"10.1007/s10489-025-06866-8\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div>Prompt tuning leverages a series of learnable prompts to effectively guide pre-trained visual language models (VLMs) to adapt to various downstream tasks. VLMs encode deep features from both visual and textual branches and learn the joint embedding space of the two modalities by optimizing the contrast loss. However, existing prompt tuning methods face two critical challenges: (1) One challenge is the forgetting of generalized knowledge. As features propagate through the visual encoder, generalizable knowledge captured in shallow layers is gradually lost, ultimately impairing the generalization ability of the joint embedding space for new classes. (2) The other challenge is that models trained on the base class suffer from semantic bias. To address these issues, we propose Visual Residual Aggregation Network for Visual-Language Prompt Tuning (VraPT). VraPT comprises two sequentially connected components: a residual aggregation module and a semantic consistency module. Firstly, in order to solve the problem of generalized knowledge forgetting, the residual aggregation module enables adaptive fusion of generalized features, which effectively preserves generalized knowledge. It also reveals the importance of shallow features in enhancing the generalization capability of text prompts. The fused representation is then fed into the semantic consistency module which is used to address the problem of semantic bias. By minimizing the divergence from the true semantic distribution, this module enhances the semantic representations in the visual space as well as the semantic coherence of the learnable prompts. Our method enables the learned prompts to retain both discriminative semantic information and generalized knowledge. Extensive experiments show that our proposed VraPT is an effective prompt tuning method, especially in recognizing new classes with great improvement. On average, VraPT improves the accuracy on base classes by 1.06% and on new classes by 2.63% across 11 datasets, along with a 1.91% gain in the harmonic mean (H) metric.</div>\",\"PeriodicalId\":8041,\"journal\":{\"name\":\"Applied Intelligence\",\"volume\":\"55 15\",\"pages\":\"\"},\"PeriodicalIF\":3.5000,\"publicationDate\":\"2025-09-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Applied Intelligence\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://link.springer.com/article/10.1007/s10489-025-06866-8\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Intelligence","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10489-025-06866-8","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

提示调优利用一系列可学习的提示来有效地指导预训练的视觉语言模型（vlm）适应各种下游任务。VLMs从视觉和文本两个分支编码深度特征，并通过优化对比度损失来学习两种模式的联合嵌入空间。然而，现有的提示调优方法面临两个关键的挑战：(1)一个挑战是泛化知识的遗忘。随着特征在视觉编码器中传播，浅层捕获的可泛化知识逐渐丢失，最终削弱了新类联合嵌入空间的泛化能力。(2)另一个挑战是在基类上训练的模型存在语义偏差。为了解决这些问题，我们提出了用于视觉语言提示调优的视觉残差聚合网络（VraPT）。VraPT由两个按顺序连接的组件组成：残差聚合模块和语义一致性模块。首先，为了解决广义知识遗忘问题，残差聚合模块实现了广义特征的自适应融合，有效地保留了广义知识；揭示了浅特征对提高文本提示泛化能力的重要性。然后将融合后的表示输入到语义一致性模块中，该模块用于解决语义偏差问题。该模块通过最小化与真实语义分布的偏离，增强了视觉空间中的语义表示以及可学习提示的语义连贯性。我们的方法使学习到的提示既保留了判别语义信息，又保留了广义知识。大量的实验表明，我们提出的VraPT是一种有效的快速调优方法，特别是在识别新类别方面有很大的改进。平均而言，VraPT在11个数据集中将基本类的准确率提高了1.06%，新类的准确率提高了2.63%，谐波平均(H)度量的准确率提高了1.91%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Visual residual aggregation network for visual-language prompt tuning

Prompt tuning leverages a series of learnable prompts to effectively guide pre-trained visual language models (VLMs) to adapt to various downstream tasks. VLMs encode deep features from both visual and textual branches and learn the joint embedding space of the two modalities by optimizing the contrast loss. However, existing prompt tuning methods face two critical challenges: (1) One challenge is the forgetting of generalized knowledge. As features propagate through the visual encoder, generalizable knowledge captured in shallow layers is gradually lost, ultimately impairing the generalization ability of the joint embedding space for new classes. (2) The other challenge is that models trained on the base class suffer from semantic bias. To address these issues, we propose Visual Residual Aggregation Network for Visual-Language Prompt Tuning (VraPT). VraPT comprises two sequentially connected components: a residual aggregation module and a semantic consistency module. Firstly, in order to solve the problem of generalized knowledge forgetting, the residual aggregation module enables adaptive fusion of generalized features, which effectively preserves generalized knowledge. It also reveals the importance of shallow features in enhancing the generalization capability of text prompts. The fused representation is then fed into the semantic consistency module which is used to address the problem of semantic bias. By minimizing the divergence from the true semantic distribution, this module enhances the semantic representations in the visual space as well as the semantic coherence of the learnable prompts. Our method enables the learned prompts to retain both discriminative semantic information and generalized knowledge. Extensive experiments show that our proposed VraPT is an effective prompt tuning method, especially in recognizing new classes with great improvement. On average, VraPT improves the accuracy on base classes by 1.06% and on new classes by 2.63% across 11 datasets, along with a 1.91% gain in the harmonic mean (H) metric.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Applied Intelligence 工程技术-计算机：人工智能

CiteScore

6.60

自引率

20.80%

发文量

1361

审稿时长

5.9 months

期刊介绍： With a focus on research in artificial intelligence and neural networks, this journal addresses issues involving solutions of real-life manufacturing, defense, management, government and industrial problems which are too complex to be solved through conventional approaches and require the simulation of intelligent thought processes, heuristics, applications of knowledge, and distributed and parallel processing. The integration of these multiple approaches in solving complex problems is of particular importance. The journal presents new and original research and technological developments, addressing real and complex issues applicable to difficult problems. It provides a medium for exchanging scientific research and technological achievements accomplished by the international community.