Vision-Language Model Priors-Driven State Space Model for Infrared-Visible Image Fusion

IF 3.2 2区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Signal Processing Letters Pub Date : 2025-06-09 DOI:10.1109/LSP.2025.3578250

Rongjin Zhuang;Yingying Wang;Xiaotong Tu;Yue Huang;Xinghao Ding

{"title":"Vision-Language Model Priors-Driven State Space Model for Infrared-Visible Image Fusion","authors":"Rongjin Zhuang;Yingying Wang;Xiaotong Tu;Yue Huang;Xinghao Ding","doi":"10.1109/LSP.2025.3578250","DOIUrl":null,"url":null,"abstract":"Infrared and visible image fusion (IVIF) aims to effectively integrate complementary information from both infrared and visible modalities, enabling a more comprehensive understanding of the scene and improving downstream semantic tasks. Recent advancements in Mamba have shown remarkable performance in image fusion, owing to its linear complexity and global receptive fields. However, leveraging Vision-Language Model (VLM) priors to drive Mamba for modality-specific feature extraction and using them as constraints to enhance fusion results has not been fully explored. To address this gap, we introduce VLMPD-Mamba, a Vision-Language Model Priors-Driven Mamba framework for IVIF. Initially, we employ the VLM to adaptively generate modality-specific textual descriptions, which enhance image quality and highlight critical target information. Next, we present Text-Controlled Mamba (TCM), which integrates textual priors from the VLM to facilitate effective modality-specific feature extraction. Furthermore, we design the Cross-modality Fusion Mamba (CFM) to fuse features from different modalities, utilizing VLM priors as constraints to enhance fusion outcomes while preserving salient targets with rich details. In addition, to promote effective cross modality feature interactions, we introduce a novel bi-modal interaction scanning strategy within the CFM. Extensive experiments on various datasets for IVIF, as well as downstream visual tasks, demonstrate the superiority of our approach over state-of-the-art (SOTA) image fusion algorithms.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"2514-2518"},"PeriodicalIF":3.2000,"publicationDate":"2025-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Signal Processing Letters","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/11029141/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Infrared and visible image fusion (IVIF) aims to effectively integrate complementary information from both infrared and visible modalities, enabling a more comprehensive understanding of the scene and improving downstream semantic tasks. Recent advancements in Mamba have shown remarkable performance in image fusion, owing to its linear complexity and global receptive fields. However, leveraging Vision-Language Model (VLM) priors to drive Mamba for modality-specific feature extraction and using them as constraints to enhance fusion results has not been fully explored. To address this gap, we introduce VLMPD-Mamba, a Vision-Language Model Priors-Driven Mamba framework for IVIF. Initially, we employ the VLM to adaptively generate modality-specific textual descriptions, which enhance image quality and highlight critical target information. Next, we present Text-Controlled Mamba (TCM), which integrates textual priors from the VLM to facilitate effective modality-specific feature extraction. Furthermore, we design the Cross-modality Fusion Mamba (CFM) to fuse features from different modalities, utilizing VLM priors as constraints to enhance fusion outcomes while preserving salient targets with rich details. In addition, to promote effective cross modality feature interactions, we introduce a novel bi-modal interaction scanning strategy within the CFM. Extensive experiments on various datasets for IVIF, as well as downstream visual tasks, demonstrate the superiority of our approach over state-of-the-art (SOTA) image fusion algorithms.

查看原文本刊更多论文

红外-可见图像融合的视觉语言模型-先验驱动状态空间模型

红外和可见光图像融合（IVIF）旨在有效地整合红外和可见光两种模式的互补信息，从而更全面地理解场景并改善下游语义任务。由于曼巴的线性复杂性和全局接受域，曼巴在图像融合方面取得了显著的进展。然而，利用视觉语言模型（VLM）先验驱动Mamba进行特定于模态的特征提取，并将其作为约束来增强融合结果，目前还没有得到充分的探索。为了解决这一差距，我们引入了VLMPD-Mamba，一种用于IVIF的视觉语言模型先验驱动的Mamba框架。首先，我们使用VLM自适应地生成特定于模态的文本描述，从而提高图像质量并突出关键目标信息。接下来，我们介绍了文本控制曼巴（TCM），它集成了来自VLM的文本先验，以促进有效的特定于模态的特征提取。此外，我们设计了跨模态融合曼巴（CFM）来融合来自不同模态的特征，利用VLM先验作为约束来增强融合结果，同时保留具有丰富细节的突出目标。此外，为了促进有效的跨模态特征相互作用，我们在CFM中引入了一种新的双模态相互作用扫描策略。在各种IVIF数据集以及下游视觉任务上进行的大量实验表明，我们的方法优于最先进的（SOTA）图像融合算法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Signal Processing Letters 工程技术-工程：电子与电气

CiteScore

7.40

自引率

12.80%

发文量

339

审稿时长

2.8 months

期刊介绍： The IEEE Signal Processing Letters is a monthly, archival publication designed to provide rapid dissemination of original, cutting-edge ideas and timely, significant contributions in signal, image, speech, language and audio processing. Papers published in the Letters can be presented within one year of their appearance in signal processing conferences such as ICASSP, GlobalSIP and ICIP, and also in several workshop organized by the Signal Processing Society.