{"title":"红外-可见图像融合的视觉语言模型-先验驱动状态空间模型","authors":"Rongjin Zhuang;Yingying Wang;Xiaotong Tu;Yue Huang;Xinghao Ding","doi":"10.1109/LSP.2025.3578250","DOIUrl":null,"url":null,"abstract":"Infrared and visible image fusion (IVIF) aims to effectively integrate complementary information from both infrared and visible modalities, enabling a more comprehensive understanding of the scene and improving downstream semantic tasks. Recent advancements in Mamba have shown remarkable performance in image fusion, owing to its linear complexity and global receptive fields. However, leveraging Vision-Language Model (VLM) priors to drive Mamba for modality-specific feature extraction and using them as constraints to enhance fusion results has not been fully explored. To address this gap, we introduce VLMPD-Mamba, a Vision-Language Model Priors-Driven Mamba framework for IVIF. Initially, we employ the VLM to adaptively generate modality-specific textual descriptions, which enhance image quality and highlight critical target information. Next, we present Text-Controlled Mamba (TCM), which integrates textual priors from the VLM to facilitate effective modality-specific feature extraction. Furthermore, we design the Cross-modality Fusion Mamba (CFM) to fuse features from different modalities, utilizing VLM priors as constraints to enhance fusion outcomes while preserving salient targets with rich details. In addition, to promote effective cross modality feature interactions, we introduce a novel bi-modal interaction scanning strategy within the CFM. Extensive experiments on various datasets for IVIF, as well as downstream visual tasks, demonstrate the superiority of our approach over state-of-the-art (SOTA) image fusion algorithms.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"2514-2518"},"PeriodicalIF":3.2000,"publicationDate":"2025-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Vision-Language Model Priors-Driven State Space Model for Infrared-Visible Image Fusion\",\"authors\":\"Rongjin Zhuang;Yingying Wang;Xiaotong Tu;Yue Huang;Xinghao Ding\",\"doi\":\"10.1109/LSP.2025.3578250\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Infrared and visible image fusion (IVIF) aims to effectively integrate complementary information from both infrared and visible modalities, enabling a more comprehensive understanding of the scene and improving downstream semantic tasks. Recent advancements in Mamba have shown remarkable performance in image fusion, owing to its linear complexity and global receptive fields. However, leveraging Vision-Language Model (VLM) priors to drive Mamba for modality-specific feature extraction and using them as constraints to enhance fusion results has not been fully explored. To address this gap, we introduce VLMPD-Mamba, a Vision-Language Model Priors-Driven Mamba framework for IVIF. Initially, we employ the VLM to adaptively generate modality-specific textual descriptions, which enhance image quality and highlight critical target information. Next, we present Text-Controlled Mamba (TCM), which integrates textual priors from the VLM to facilitate effective modality-specific feature extraction. Furthermore, we design the Cross-modality Fusion Mamba (CFM) to fuse features from different modalities, utilizing VLM priors as constraints to enhance fusion outcomes while preserving salient targets with rich details. In addition, to promote effective cross modality feature interactions, we introduce a novel bi-modal interaction scanning strategy within the CFM. Extensive experiments on various datasets for IVIF, as well as downstream visual tasks, demonstrate the superiority of our approach over state-of-the-art (SOTA) image fusion algorithms.\",\"PeriodicalId\":13154,\"journal\":{\"name\":\"IEEE Signal Processing Letters\",\"volume\":\"32 \",\"pages\":\"2514-2518\"},\"PeriodicalIF\":3.2000,\"publicationDate\":\"2025-06-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Signal Processing Letters\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11029141/\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Signal Processing Letters","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/11029141/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
Vision-Language Model Priors-Driven State Space Model for Infrared-Visible Image Fusion
Infrared and visible image fusion (IVIF) aims to effectively integrate complementary information from both infrared and visible modalities, enabling a more comprehensive understanding of the scene and improving downstream semantic tasks. Recent advancements in Mamba have shown remarkable performance in image fusion, owing to its linear complexity and global receptive fields. However, leveraging Vision-Language Model (VLM) priors to drive Mamba for modality-specific feature extraction and using them as constraints to enhance fusion results has not been fully explored. To address this gap, we introduce VLMPD-Mamba, a Vision-Language Model Priors-Driven Mamba framework for IVIF. Initially, we employ the VLM to adaptively generate modality-specific textual descriptions, which enhance image quality and highlight critical target information. Next, we present Text-Controlled Mamba (TCM), which integrates textual priors from the VLM to facilitate effective modality-specific feature extraction. Furthermore, we design the Cross-modality Fusion Mamba (CFM) to fuse features from different modalities, utilizing VLM priors as constraints to enhance fusion outcomes while preserving salient targets with rich details. In addition, to promote effective cross modality feature interactions, we introduce a novel bi-modal interaction scanning strategy within the CFM. Extensive experiments on various datasets for IVIF, as well as downstream visual tasks, demonstrate the superiority of our approach over state-of-the-art (SOTA) image fusion algorithms.
期刊介绍:
The IEEE Signal Processing Letters is a monthly, archival publication designed to provide rapid dissemination of original, cutting-edge ideas and timely, significant contributions in signal, image, speech, language and audio processing. Papers published in the Letters can be presented within one year of their appearance in signal processing conferences such as ICASSP, GlobalSIP and ICIP, and also in several workshop organized by the Signal Processing Society.