基于选择性特征融合的多模态提示学习：面向鲁棒跨模态对齐

IF 3.5 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Applied Intelligence Pub Date : 2025-10-03 DOI:10.1007/s10489-025-06919-y

Jiabao Han, Yahui Wang, Wei Zhong, Ying Zhang, Xichao Yuan

{"title":"基于选择性特征融合的多模态提示学习：面向鲁棒跨模态对齐","authors":"Jiabao Han, Yahui Wang, Wei Zhong, Ying Zhang, Xichao Yuan","doi":"10.1007/s10489-025-06919-y","DOIUrl":null,"url":null,"abstract":"<div><p>Vision–language models (VLMs) have shown impressive transferability but still struggle with robustness and generalization when applied to downstream tasks with limited supervision. To address these challenges, we propose a Selective Feature Fusion (SFF) framework that adaptively suppresses noisy visual regions and reinforces task-relevant cross-modal cues through lightweight, learnable gating. Our approach integrates text-guided visual masking and image-aware textual calibration into a unified pipeline, enabling more discriminative and semantically aligned multimodal representations. Comprehensive evaluations across nine widely used benchmarks demonstrate that our method consistently surpasses strong prompt-learning baselines under both few-shot and base-to-novel generalization settings. In particular, under the 8-shot scenario, our approach achieves the best overall accuracy, maintaining a clear margin over representative methods such as CoCoOp and MaPLe. These results highlight not only the robustness of our design but also its effectiveness in capturing cross-modal semantics under data-limited conditions. Further analyses, including ablation studies and qualitative visualizations, confirm that the proposed gating and calibration modules are complementary and play indispensable roles in improving performance. Taken together, this work provides a simple yet powerful strategy for enhancing the adaptability and generalization of VLMs in real-world scenarios.</p></div>","PeriodicalId":8041,"journal":{"name":"Applied Intelligence","volume":"55 15","pages":""},"PeriodicalIF":3.5000,"publicationDate":"2025-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multimodal prompt learning with selective feature fusion: towards robust cross-modal alignment\",\"authors\":\"Jiabao Han, Yahui Wang, Wei Zhong, Ying Zhang, Xichao Yuan\",\"doi\":\"10.1007/s10489-025-06919-y\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Vision–language models (VLMs) have shown impressive transferability but still struggle with robustness and generalization when applied to downstream tasks with limited supervision. To address these challenges, we propose a Selective Feature Fusion (SFF) framework that adaptively suppresses noisy visual regions and reinforces task-relevant cross-modal cues through lightweight, learnable gating. Our approach integrates text-guided visual masking and image-aware textual calibration into a unified pipeline, enabling more discriminative and semantically aligned multimodal representations. Comprehensive evaluations across nine widely used benchmarks demonstrate that our method consistently surpasses strong prompt-learning baselines under both few-shot and base-to-novel generalization settings. In particular, under the 8-shot scenario, our approach achieves the best overall accuracy, maintaining a clear margin over representative methods such as CoCoOp and MaPLe. These results highlight not only the robustness of our design but also its effectiveness in capturing cross-modal semantics under data-limited conditions. Further analyses, including ablation studies and qualitative visualizations, confirm that the proposed gating and calibration modules are complementary and play indispensable roles in improving performance. Taken together, this work provides a simple yet powerful strategy for enhancing the adaptability and generalization of VLMs in real-world scenarios.</p></div>\",\"PeriodicalId\":8041,\"journal\":{\"name\":\"Applied Intelligence\",\"volume\":\"55 15\",\"pages\":\"\"},\"PeriodicalIF\":3.5000,\"publicationDate\":\"2025-10-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Applied Intelligence\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://link.springer.com/article/10.1007/s10489-025-06919-y\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Intelligence","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10489-025-06919-y","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

视觉语言模型（VLMs）已经显示出令人印象深刻的可转移性，但是当应用于有限监督的下游任务时，仍然存在鲁棒性和泛化的问题。为了解决这些挑战，我们提出了一种选择性特征融合（SFF）框架，该框架可自适应地抑制噪声视觉区域，并通过轻量级、可学习的门控强化任务相关的跨模态线索。我们的方法将文本引导的视觉掩蔽和图像感知的文本校准集成到一个统一的管道中，从而实现更具判别性和语义对齐的多模态表示。对九个广泛使用的基准的综合评估表明，我们的方法在少量射击和基础到新泛化设置下始终优于强提示学习基线。特别是，在8次射击场景下，我们的方法达到了最佳的整体精度，与cooop和MaPLe等代表性方法相比保持了明显的优势。这些结果不仅突出了我们的设计的鲁棒性，而且还突出了它在数据有限条件下捕获跨模态语义的有效性。进一步的分析，包括烧蚀研究和定性可视化，证实了所提出的门控和校准模块是互补的，在提高性能方面发挥着不可或缺的作用。综上所述，这项工作为增强vlm在现实场景中的适应性和泛化提供了一个简单而强大的策略。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Multimodal prompt learning with selective feature fusion: towards robust cross-modal alignment

Vision–language models (VLMs) have shown impressive transferability but still struggle with robustness and generalization when applied to downstream tasks with limited supervision. To address these challenges, we propose a Selective Feature Fusion (SFF) framework that adaptively suppresses noisy visual regions and reinforces task-relevant cross-modal cues through lightweight, learnable gating. Our approach integrates text-guided visual masking and image-aware textual calibration into a unified pipeline, enabling more discriminative and semantically aligned multimodal representations. Comprehensive evaluations across nine widely used benchmarks demonstrate that our method consistently surpasses strong prompt-learning baselines under both few-shot and base-to-novel generalization settings. In particular, under the 8-shot scenario, our approach achieves the best overall accuracy, maintaining a clear margin over representative methods such as CoCoOp and MaPLe. These results highlight not only the robustness of our design but also its effectiveness in capturing cross-modal semantics under data-limited conditions. Further analyses, including ablation studies and qualitative visualizations, confirm that the proposed gating and calibration modules are complementary and play indispensable roles in improving performance. Taken together, this work provides a simple yet powerful strategy for enhancing the adaptability and generalization of VLMs in real-world scenarios.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Applied Intelligence 工程技术-计算机：人工智能

CiteScore

6.60

自引率

20.80%

发文量

1361

审稿时长

5.9 months

期刊介绍： With a focus on research in artificial intelligence and neural networks, this journal addresses issues involving solutions of real-life manufacturing, defense, management, government and industrial problems which are too complex to be solved through conventional approaches and require the simulation of intelligent thought processes, heuristics, applications of knowledge, and distributed and parallel processing. The integration of these multiple approaches in solving complex problems is of particular importance. The journal presents new and original research and technological developments, addressing real and complex issues applicable to difficult problems. It provides a medium for exchanging scientific research and technological achievements accomplished by the international community.