Jianming Zhang , Xiangnan Shi , Zhijian Feng , Yan Gui , Jin Wang
{"title":"TMCN: Text-guided Mamba-CNN dual-encoder network for infrared and visible image fusion","authors":"Jianming Zhang , Xiangnan Shi , Zhijian Feng , Yan Gui , Jin Wang","doi":"10.1016/j.infrared.2025.105895","DOIUrl":null,"url":null,"abstract":"<div><div>Infrared and visible image fusion (IVF) combines the complementary advantages of two images from different physical imaging methods to create a new image with richer information. To better address issues such as weak texture details, low contrast, and poor visual perception of overexposed and underexposed areas, we propose a text-guided Mamba-CNN dual-encoder network (TMCN). Firstly, to leverage the feature extraction capabilities of Mamba and CNN, we design a pre-training network to train a Mamba-based encoder, a CNN-based encoder, and a decoder. The structures of these encoders are also used in the image fusion stage. Then, we introduce a hybrid Mamba-CNN dual-encoder to extract global and local features from infrared and visible images, resulting in four distinct types of feature information. Secondly, we design a global fusion block (GFB) via the Mamba-based encoder, and a local fusion block (LFB) via the CNN-based encoder, to fuse the global and local features of the two modalities, respectively. Following these fusion blocks, we introduce text semantic information and utilize its stable and targeted characteristics to better solve the above problems. Therefore, we propose a plug-and-play text-guided block (TB) that first uses a CLIP-based text encoder to encode the input text, and then exploits feed-forward neural network (FFN) to extract two parameters for subsequent linear transformations, which reflect the text-guided mechanism. Finally, numerous experiments demonstrate that our method achieves excellent performance in IVF and has strong versatility. Furthermore, our method enhances the performance of downstream tasks such as object detection and semantic segmentation. The code will be available at <span><span>https://github.com/XiangnanShi-CSUST/TMCN</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":13549,"journal":{"name":"Infrared Physics & Technology","volume":"149 ","pages":"Article 105895"},"PeriodicalIF":3.1000,"publicationDate":"2025-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Infrared Physics & Technology","FirstCategoryId":"101","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1350449525001884","RegionNum":3,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"INSTRUMENTS & INSTRUMENTATION","Score":null,"Total":0}
引用次数: 0
Abstract
Infrared and visible image fusion (IVF) combines the complementary advantages of two images from different physical imaging methods to create a new image with richer information. To better address issues such as weak texture details, low contrast, and poor visual perception of overexposed and underexposed areas, we propose a text-guided Mamba-CNN dual-encoder network (TMCN). Firstly, to leverage the feature extraction capabilities of Mamba and CNN, we design a pre-training network to train a Mamba-based encoder, a CNN-based encoder, and a decoder. The structures of these encoders are also used in the image fusion stage. Then, we introduce a hybrid Mamba-CNN dual-encoder to extract global and local features from infrared and visible images, resulting in four distinct types of feature information. Secondly, we design a global fusion block (GFB) via the Mamba-based encoder, and a local fusion block (LFB) via the CNN-based encoder, to fuse the global and local features of the two modalities, respectively. Following these fusion blocks, we introduce text semantic information and utilize its stable and targeted characteristics to better solve the above problems. Therefore, we propose a plug-and-play text-guided block (TB) that first uses a CLIP-based text encoder to encode the input text, and then exploits feed-forward neural network (FFN) to extract two parameters for subsequent linear transformations, which reflect the text-guided mechanism. Finally, numerous experiments demonstrate that our method achieves excellent performance in IVF and has strong versatility. Furthermore, our method enhances the performance of downstream tasks such as object detection and semantic segmentation. The code will be available at https://github.com/XiangnanShi-CSUST/TMCN.
期刊介绍:
The Journal covers the entire field of infrared physics and technology: theory, experiment, application, devices and instrumentation. Infrared'' is defined as covering the near, mid and far infrared (terahertz) regions from 0.75um (750nm) to 1mm (300GHz.) Submissions in the 300GHz to 100GHz region may be accepted at the editors discretion if their content is relevant to shorter wavelengths. Submissions must be primarily concerned with and directly relevant to this spectral region.
Its core topics can be summarized as the generation, propagation and detection, of infrared radiation; the associated optics, materials and devices; and its use in all fields of science, industry, engineering and medicine.
Infrared techniques occur in many different fields, notably spectroscopy and interferometry; material characterization and processing; atmospheric physics, astronomy and space research. Scientific aspects include lasers, quantum optics, quantum electronics, image processing and semiconductor physics. Some important applications are medical diagnostics and treatment, industrial inspection and environmental monitoring.