Vision-Language Model Priors-Driven State Space Model for Infrared-Visible Image Fusion

IF 3.2 2区 工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC
Rongjin Zhuang;Yingying Wang;Xiaotong Tu;Yue Huang;Xinghao Ding
{"title":"Vision-Language Model Priors-Driven State Space Model for Infrared-Visible Image Fusion","authors":"Rongjin Zhuang;Yingying Wang;Xiaotong Tu;Yue Huang;Xinghao Ding","doi":"10.1109/LSP.2025.3578250","DOIUrl":null,"url":null,"abstract":"Infrared and visible image fusion (IVIF) aims to effectively integrate complementary information from both infrared and visible modalities, enabling a more comprehensive understanding of the scene and improving downstream semantic tasks. Recent advancements in Mamba have shown remarkable performance in image fusion, owing to its linear complexity and global receptive fields. However, leveraging Vision-Language Model (VLM) priors to drive Mamba for modality-specific feature extraction and using them as constraints to enhance fusion results has not been fully explored. To address this gap, we introduce VLMPD-Mamba, a Vision-Language Model Priors-Driven Mamba framework for IVIF. Initially, we employ the VLM to adaptively generate modality-specific textual descriptions, which enhance image quality and highlight critical target information. Next, we present Text-Controlled Mamba (TCM), which integrates textual priors from the VLM to facilitate effective modality-specific feature extraction. Furthermore, we design the Cross-modality Fusion Mamba (CFM) to fuse features from different modalities, utilizing VLM priors as constraints to enhance fusion outcomes while preserving salient targets with rich details. In addition, to promote effective cross modality feature interactions, we introduce a novel bi-modal interaction scanning strategy within the CFM. Extensive experiments on various datasets for IVIF, as well as downstream visual tasks, demonstrate the superiority of our approach over state-of-the-art (SOTA) image fusion algorithms.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"2514-2518"},"PeriodicalIF":3.2000,"publicationDate":"2025-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Signal Processing Letters","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/11029141/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

Abstract

Infrared and visible image fusion (IVIF) aims to effectively integrate complementary information from both infrared and visible modalities, enabling a more comprehensive understanding of the scene and improving downstream semantic tasks. Recent advancements in Mamba have shown remarkable performance in image fusion, owing to its linear complexity and global receptive fields. However, leveraging Vision-Language Model (VLM) priors to drive Mamba for modality-specific feature extraction and using them as constraints to enhance fusion results has not been fully explored. To address this gap, we introduce VLMPD-Mamba, a Vision-Language Model Priors-Driven Mamba framework for IVIF. Initially, we employ the VLM to adaptively generate modality-specific textual descriptions, which enhance image quality and highlight critical target information. Next, we present Text-Controlled Mamba (TCM), which integrates textual priors from the VLM to facilitate effective modality-specific feature extraction. Furthermore, we design the Cross-modality Fusion Mamba (CFM) to fuse features from different modalities, utilizing VLM priors as constraints to enhance fusion outcomes while preserving salient targets with rich details. In addition, to promote effective cross modality feature interactions, we introduce a novel bi-modal interaction scanning strategy within the CFM. Extensive experiments on various datasets for IVIF, as well as downstream visual tasks, demonstrate the superiority of our approach over state-of-the-art (SOTA) image fusion algorithms.
红外-可见图像融合的视觉语言模型-先验驱动状态空间模型
红外和可见光图像融合(IVIF)旨在有效地整合红外和可见光两种模式的互补信息,从而更全面地理解场景并改善下游语义任务。由于曼巴的线性复杂性和全局接受域,曼巴在图像融合方面取得了显著的进展。然而,利用视觉语言模型(VLM)先验驱动Mamba进行特定于模态的特征提取,并将其作为约束来增强融合结果,目前还没有得到充分的探索。为了解决这一差距,我们引入了VLMPD-Mamba,一种用于IVIF的视觉语言模型先验驱动的Mamba框架。首先,我们使用VLM自适应地生成特定于模态的文本描述,从而提高图像质量并突出关键目标信息。接下来,我们介绍了文本控制曼巴(TCM),它集成了来自VLM的文本先验,以促进有效的特定于模态的特征提取。此外,我们设计了跨模态融合曼巴(CFM)来融合来自不同模态的特征,利用VLM先验作为约束来增强融合结果,同时保留具有丰富细节的突出目标。此外,为了促进有效的跨模态特征相互作用,我们在CFM中引入了一种新的双模态相互作用扫描策略。在各种IVIF数据集以及下游视觉任务上进行的大量实验表明,我们的方法优于最先进的(SOTA)图像融合算法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
IEEE Signal Processing Letters
IEEE Signal Processing Letters 工程技术-工程:电子与电气
CiteScore
7.40
自引率
12.80%
发文量
339
审稿时长
2.8 months
期刊介绍: The IEEE Signal Processing Letters is a monthly, archival publication designed to provide rapid dissemination of original, cutting-edge ideas and timely, significant contributions in signal, image, speech, language and audio processing. Papers published in the Letters can be presented within one year of their appearance in signal processing conferences such as ICASSP, GlobalSIP and ICIP, and also in several workshop organized by the Signal Processing Society.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信