{"title":"用于视觉语言提示调优的视觉残差聚合网络","authors":"Yunqian Yu, Feng Guo, Xianlong Tian, Biao Chen, Mengmeng Jing, Lin Zuo","doi":"10.1007/s10489-025-06866-8","DOIUrl":null,"url":null,"abstract":"<div><p>Prompt tuning leverages a series of learnable prompts to effectively guide pre-trained visual language models (VLMs) to adapt to various downstream tasks. VLMs encode deep features from both visual and textual branches and learn the joint embedding space of the two modalities by optimizing the contrast loss. However, existing prompt tuning methods face two critical challenges: (1) One challenge is the forgetting of generalized knowledge. As features propagate through the visual encoder, generalizable knowledge captured in shallow layers is gradually lost, ultimately impairing the generalization ability of the joint embedding space for new classes. (2) The other challenge is that models trained on the base class suffer from semantic bias. To address these issues, we propose <b><u>V</u></b>isual <b><u>R</u></b>esidual <b><u>A</u></b>ggregation Network for Visual-Language <b><u>P</u></b>rompt <b><u>T</u></b>uning (VraPT). VraPT comprises two sequentially connected components: a residual aggregation module and a semantic consistency module. Firstly, in order to solve the problem of generalized knowledge forgetting, the residual aggregation module enables adaptive fusion of generalized features, which effectively preserves generalized knowledge. It also reveals the importance of shallow features in enhancing the generalization capability of text prompts. The fused representation is then fed into the semantic consistency module which is used to address the problem of semantic bias. By minimizing the divergence from the true semantic distribution, this module enhances the semantic representations in the visual space as well as the semantic coherence of the learnable prompts. Our method enables the learned prompts to retain both discriminative semantic information and generalized knowledge. Extensive experiments show that our proposed VraPT is an effective prompt tuning method, especially in recognizing new classes with great improvement. On average, VraPT improves the accuracy on base classes by 1.06% and on new classes by 2.63% across 11 datasets, along with a 1.91% gain in the harmonic mean (H) metric.</p></div>","PeriodicalId":8041,"journal":{"name":"Applied Intelligence","volume":"55 15","pages":""},"PeriodicalIF":3.5000,"publicationDate":"2025-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Visual residual aggregation network for visual-language prompt tuning\",\"authors\":\"Yunqian Yu, Feng Guo, Xianlong Tian, Biao Chen, Mengmeng Jing, Lin Zuo\",\"doi\":\"10.1007/s10489-025-06866-8\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Prompt tuning leverages a series of learnable prompts to effectively guide pre-trained visual language models (VLMs) to adapt to various downstream tasks. VLMs encode deep features from both visual and textual branches and learn the joint embedding space of the two modalities by optimizing the contrast loss. However, existing prompt tuning methods face two critical challenges: (1) One challenge is the forgetting of generalized knowledge. As features propagate through the visual encoder, generalizable knowledge captured in shallow layers is gradually lost, ultimately impairing the generalization ability of the joint embedding space for new classes. (2) The other challenge is that models trained on the base class suffer from semantic bias. To address these issues, we propose <b><u>V</u></b>isual <b><u>R</u></b>esidual <b><u>A</u></b>ggregation Network for Visual-Language <b><u>P</u></b>rompt <b><u>T</u></b>uning (VraPT). VraPT comprises two sequentially connected components: a residual aggregation module and a semantic consistency module. Firstly, in order to solve the problem of generalized knowledge forgetting, the residual aggregation module enables adaptive fusion of generalized features, which effectively preserves generalized knowledge. It also reveals the importance of shallow features in enhancing the generalization capability of text prompts. The fused representation is then fed into the semantic consistency module which is used to address the problem of semantic bias. By minimizing the divergence from the true semantic distribution, this module enhances the semantic representations in the visual space as well as the semantic coherence of the learnable prompts. Our method enables the learned prompts to retain both discriminative semantic information and generalized knowledge. Extensive experiments show that our proposed VraPT is an effective prompt tuning method, especially in recognizing new classes with great improvement. On average, VraPT improves the accuracy on base classes by 1.06% and on new classes by 2.63% across 11 datasets, along with a 1.91% gain in the harmonic mean (H) metric.</p></div>\",\"PeriodicalId\":8041,\"journal\":{\"name\":\"Applied Intelligence\",\"volume\":\"55 15\",\"pages\":\"\"},\"PeriodicalIF\":3.5000,\"publicationDate\":\"2025-09-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Applied Intelligence\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://link.springer.com/article/10.1007/s10489-025-06866-8\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Intelligence","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10489-025-06866-8","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Visual residual aggregation network for visual-language prompt tuning
Prompt tuning leverages a series of learnable prompts to effectively guide pre-trained visual language models (VLMs) to adapt to various downstream tasks. VLMs encode deep features from both visual and textual branches and learn the joint embedding space of the two modalities by optimizing the contrast loss. However, existing prompt tuning methods face two critical challenges: (1) One challenge is the forgetting of generalized knowledge. As features propagate through the visual encoder, generalizable knowledge captured in shallow layers is gradually lost, ultimately impairing the generalization ability of the joint embedding space for new classes. (2) The other challenge is that models trained on the base class suffer from semantic bias. To address these issues, we propose Visual Residual Aggregation Network for Visual-Language Prompt Tuning (VraPT). VraPT comprises two sequentially connected components: a residual aggregation module and a semantic consistency module. Firstly, in order to solve the problem of generalized knowledge forgetting, the residual aggregation module enables adaptive fusion of generalized features, which effectively preserves generalized knowledge. It also reveals the importance of shallow features in enhancing the generalization capability of text prompts. The fused representation is then fed into the semantic consistency module which is used to address the problem of semantic bias. By minimizing the divergence from the true semantic distribution, this module enhances the semantic representations in the visual space as well as the semantic coherence of the learnable prompts. Our method enables the learned prompts to retain both discriminative semantic information and generalized knowledge. Extensive experiments show that our proposed VraPT is an effective prompt tuning method, especially in recognizing new classes with great improvement. On average, VraPT improves the accuracy on base classes by 1.06% and on new classes by 2.63% across 11 datasets, along with a 1.91% gain in the harmonic mean (H) metric.
期刊介绍:
With a focus on research in artificial intelligence and neural networks, this journal addresses issues involving solutions of real-life manufacturing, defense, management, government and industrial problems which are too complex to be solved through conventional approaches and require the simulation of intelligent thought processes, heuristics, applications of knowledge, and distributed and parallel processing. The integration of these multiple approaches in solving complex problems is of particular importance.
The journal presents new and original research and technological developments, addressing real and complex issues applicable to difficult problems. It provides a medium for exchanging scientific research and technological achievements accomplished by the international community.