TMCN: Text-guided Mamba-CNN dual-encoder network for infrared and visible image fusion

IF 3.1 3区物理与天体物理 Q2 INSTRUMENTS & INSTRUMENTATION

Infrared Physics & Technology Pub Date : 2025-05-10 DOI:10.1016/j.infrared.2025.105895

Jianming Zhang , Xiangnan Shi , Zhijian Feng , Yan Gui , Jin Wang

{"title":"TMCN: Text-guided Mamba-CNN dual-encoder network for infrared and visible image fusion","authors":"Jianming Zhang , Xiangnan Shi , Zhijian Feng , Yan Gui , Jin Wang","doi":"10.1016/j.infrared.2025.105895","DOIUrl":null,"url":null,"abstract":"<div><div>Infrared and visible image fusion (IVF) combines the complementary advantages of two images from different physical imaging methods to create a new image with richer information. To better address issues such as weak texture details, low contrast, and poor visual perception of overexposed and underexposed areas, we propose a text-guided Mamba-CNN dual-encoder network (TMCN). Firstly, to leverage the feature extraction capabilities of Mamba and CNN, we design a pre-training network to train a Mamba-based encoder, a CNN-based encoder, and a decoder. The structures of these encoders are also used in the image fusion stage. Then, we introduce a hybrid Mamba-CNN dual-encoder to extract global and local features from infrared and visible images, resulting in four distinct types of feature information. Secondly, we design a global fusion block (GFB) via the Mamba-based encoder, and a local fusion block (LFB) via the CNN-based encoder, to fuse the global and local features of the two modalities, respectively. Following these fusion blocks, we introduce text semantic information and utilize its stable and targeted characteristics to better solve the above problems. Therefore, we propose a plug-and-play text-guided block (TB) that first uses a CLIP-based text encoder to encode the input text, and then exploits feed-forward neural network (FFN) to extract two parameters for subsequent linear transformations, which reflect the text-guided mechanism. Finally, numerous experiments demonstrate that our method achieves excellent performance in IVF and has strong versatility. Furthermore, our method enhances the performance of downstream tasks such as object detection and semantic segmentation. The code will be available at <span><span>https://github.com/XiangnanShi-CSUST/TMCN</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":13549,"journal":{"name":"Infrared Physics & Technology","volume":"149 ","pages":"Article 105895"},"PeriodicalIF":3.1000,"publicationDate":"2025-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Infrared Physics & Technology","FirstCategoryId":"101","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1350449525001884","RegionNum":3,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"INSTRUMENTS & INSTRUMENTATION","Score":null,"Total":0}

引用次数: 0

Abstract

Infrared and visible image fusion (IVF) combines the complementary advantages of two images from different physical imaging methods to create a new image with richer information. To better address issues such as weak texture details, low contrast, and poor visual perception of overexposed and underexposed areas, we propose a text-guided Mamba-CNN dual-encoder network (TMCN). Firstly, to leverage the feature extraction capabilities of Mamba and CNN, we design a pre-training network to train a Mamba-based encoder, a CNN-based encoder, and a decoder. The structures of these encoders are also used in the image fusion stage. Then, we introduce a hybrid Mamba-CNN dual-encoder to extract global and local features from infrared and visible images, resulting in four distinct types of feature information. Secondly, we design a global fusion block (GFB) via the Mamba-based encoder, and a local fusion block (LFB) via the CNN-based encoder, to fuse the global and local features of the two modalities, respectively. Following these fusion blocks, we introduce text semantic information and utilize its stable and targeted characteristics to better solve the above problems. Therefore, we propose a plug-and-play text-guided block (TB) that first uses a CLIP-based text encoder to encode the input text, and then exploits feed-forward neural network (FFN) to extract two parameters for subsequent linear transformations, which reflect the text-guided mechanism. Finally, numerous experiments demonstrate that our method achieves excellent performance in IVF and has strong versatility. Furthermore, our method enhances the performance of downstream tasks such as object detection and semantic segmentation. The code will be available at https://github.com/XiangnanShi-CSUST/TMCN.

查看原文本刊更多论文

TMCN：用于红外和可见光图像融合的文本引导Mamba-CNN双编码器网络

红外与可见光图像融合（IVF）是将两种不同物理成像方法图像的互补优势结合起来，生成信息更丰富的新图像。为了更好地解决纹理细节弱、对比度低以及曝光过度和曝光不足区域视觉感知差等问题，我们提出了一种文本引导的Mamba-CNN双编码器网络（TMCN）。首先，为了利用Mamba和CNN的特征提取能力，我们设计了一个预训练网络来训练一个基于Mamba的编码器、一个基于CNN的编码器和一个解码器。这些编码器的结构也被用于图像融合阶段。然后，我们引入一种混合的Mamba-CNN双编码器，从红外和可见光图像中提取全局和局部特征，得到四种不同类型的特征信息。其次，利用基于mamba的编码器设计了全局融合块（GFB），利用基于cnn的编码器设计了局部融合块（LFB），分别融合了两种模态的全局和局部特征。在这些融合块的基础上，引入文本语义信息，利用其稳定性和针对性的特点，更好地解决上述问题。因此，我们提出了一种即插即用的文本引导块（TB），它首先使用基于clip的文本编码器对输入文本进行编码，然后利用前馈神经网络（FFN）提取两个参数用于后续的线性转换，这反映了文本引导机制。最后，大量的实验表明，我们的方法在试管婴儿中取得了优异的性能，具有很强的通用性。此外，我们的方法提高了下游任务的性能，如目标检测和语义分割。代码可在https://github.com/XiangnanShi-CSUST/TMCN上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Infrared Physics & Technology 物理-光学

CiteScore

5.70

自引率

12.10%

发文量

400

审稿时长

67 days

期刊介绍： The Journal covers the entire field of infrared physics and technology: theory, experiment, application, devices and instrumentation. Infrared'' is defined as covering the near, mid and far infrared (terahertz) regions from 0.75um (750nm) to 1mm (300GHz.) Submissions in the 300GHz to 100GHz region may be accepted at the editors discretion if their content is relevant to shorter wavelengths. Submissions must be primarily concerned with and directly relevant to this spectral region. Its core topics can be summarized as the generation, propagation and detection, of infrared radiation; the associated optics, materials and devices; and its use in all fields of science, industry, engineering and medicine. Infrared techniques occur in many different fields, notably spectroscopy and interferometry; material characterization and processing; atmospheric physics, astronomy and space research. Scientific aspects include lasers, quantum optics, quantum electronics, image processing and semiconductor physics. Some important applications are medical diagnostics and treatment, industrial inspection and environmental monitoring.