{"title":"sweif:显式和隐式Swin变压器融合红外和可见光图像","authors":"Hongfu Zhang , Xing Wu , Qiuman Zeng , Linrui Shi , Gaochang Wu","doi":"10.1016/j.infrared.2025.106156","DOIUrl":null,"url":null,"abstract":"<div><div>Infrared and visible image fusion aims to integrate thermal target information from infrared images with fine texture details from visible images, enabling comprehensive scene perception in complex environments. Existing explicit methods rely on hand-crafted rules that struggle with cross-modal discrepancies, while implicit end-to-end models require large paired datasets that are often unavailable. To address these challenges, we propose SwinEIF, a <u>Swin</u> Transformer-based fusion framework that synergizes <u>E</u>xplicit and <u>I</u>mplicit paradigms for infrared-visible image <u>F</u>usion. The framework innovatively combines an explicit unimodal feature extraction branch, which learns modality-specific representations through Swin Transformer’s hierarchical self-attention, with an implicit multi-modal feature interaction branch that facilitates adaptive feature fusion via cross-attention between modalities. Additionally, a Discrete Wavelet Transform (DWT)-based fusion decoder is incorporated to fuse features from different frequency bands, and this fusion process uses the content of the source images as fusion weights to generate end-to-end fusion outputs. Leveraging the strengths of the explicit image fusion paradigm, the unimodal feature extraction branch is trained on large-scale, unaligned infrared and visible images, enabling the network to capture diverse patterns and comprehensively extract global information in the first training stage. In the second stage, a smaller, aligned infrared-visible image dataset is then used to fine-tune the multi-modal feature interaction branch and the DWT-based fusion decoder, ensuring high-quality fusion outputs. By fully combining the advantages of both paradigms, SwinEIF demonstrates superior performance across multiple infrared-visible image datasets, outperforming state-of-the-art fusion methods. Experimental results confirm that SwinEIF excels in both subjective visual quality and objective evaluation metrics, showcasing remarkable fusion performance and strong generalization capabilities.</div></div>","PeriodicalId":13549,"journal":{"name":"Infrared Physics & Technology","volume":"151 ","pages":"Article 106156"},"PeriodicalIF":3.4000,"publicationDate":"2025-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"SwinEIF: Explicit and implicit Swin Transformer fusion for infrared and visible images\",\"authors\":\"Hongfu Zhang , Xing Wu , Qiuman Zeng , Linrui Shi , Gaochang Wu\",\"doi\":\"10.1016/j.infrared.2025.106156\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Infrared and visible image fusion aims to integrate thermal target information from infrared images with fine texture details from visible images, enabling comprehensive scene perception in complex environments. Existing explicit methods rely on hand-crafted rules that struggle with cross-modal discrepancies, while implicit end-to-end models require large paired datasets that are often unavailable. To address these challenges, we propose SwinEIF, a <u>Swin</u> Transformer-based fusion framework that synergizes <u>E</u>xplicit and <u>I</u>mplicit paradigms for infrared-visible image <u>F</u>usion. The framework innovatively combines an explicit unimodal feature extraction branch, which learns modality-specific representations through Swin Transformer’s hierarchical self-attention, with an implicit multi-modal feature interaction branch that facilitates adaptive feature fusion via cross-attention between modalities. Additionally, a Discrete Wavelet Transform (DWT)-based fusion decoder is incorporated to fuse features from different frequency bands, and this fusion process uses the content of the source images as fusion weights to generate end-to-end fusion outputs. Leveraging the strengths of the explicit image fusion paradigm, the unimodal feature extraction branch is trained on large-scale, unaligned infrared and visible images, enabling the network to capture diverse patterns and comprehensively extract global information in the first training stage. In the second stage, a smaller, aligned infrared-visible image dataset is then used to fine-tune the multi-modal feature interaction branch and the DWT-based fusion decoder, ensuring high-quality fusion outputs. By fully combining the advantages of both paradigms, SwinEIF demonstrates superior performance across multiple infrared-visible image datasets, outperforming state-of-the-art fusion methods. Experimental results confirm that SwinEIF excels in both subjective visual quality and objective evaluation metrics, showcasing remarkable fusion performance and strong generalization capabilities.</div></div>\",\"PeriodicalId\":13549,\"journal\":{\"name\":\"Infrared Physics & Technology\",\"volume\":\"151 \",\"pages\":\"Article 106156\"},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2025-09-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Infrared Physics & Technology\",\"FirstCategoryId\":\"101\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1350449525004499\",\"RegionNum\":3,\"RegionCategory\":\"物理与天体物理\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"INSTRUMENTS & INSTRUMENTATION\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Infrared Physics & Technology","FirstCategoryId":"101","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1350449525004499","RegionNum":3,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"INSTRUMENTS & INSTRUMENTATION","Score":null,"Total":0}
SwinEIF: Explicit and implicit Swin Transformer fusion for infrared and visible images
Infrared and visible image fusion aims to integrate thermal target information from infrared images with fine texture details from visible images, enabling comprehensive scene perception in complex environments. Existing explicit methods rely on hand-crafted rules that struggle with cross-modal discrepancies, while implicit end-to-end models require large paired datasets that are often unavailable. To address these challenges, we propose SwinEIF, a Swin Transformer-based fusion framework that synergizes Explicit and Implicit paradigms for infrared-visible image Fusion. The framework innovatively combines an explicit unimodal feature extraction branch, which learns modality-specific representations through Swin Transformer’s hierarchical self-attention, with an implicit multi-modal feature interaction branch that facilitates adaptive feature fusion via cross-attention between modalities. Additionally, a Discrete Wavelet Transform (DWT)-based fusion decoder is incorporated to fuse features from different frequency bands, and this fusion process uses the content of the source images as fusion weights to generate end-to-end fusion outputs. Leveraging the strengths of the explicit image fusion paradigm, the unimodal feature extraction branch is trained on large-scale, unaligned infrared and visible images, enabling the network to capture diverse patterns and comprehensively extract global information in the first training stage. In the second stage, a smaller, aligned infrared-visible image dataset is then used to fine-tune the multi-modal feature interaction branch and the DWT-based fusion decoder, ensuring high-quality fusion outputs. By fully combining the advantages of both paradigms, SwinEIF demonstrates superior performance across multiple infrared-visible image datasets, outperforming state-of-the-art fusion methods. Experimental results confirm that SwinEIF excels in both subjective visual quality and objective evaluation metrics, showcasing remarkable fusion performance and strong generalization capabilities.
期刊介绍:
The Journal covers the entire field of infrared physics and technology: theory, experiment, application, devices and instrumentation. Infrared'' is defined as covering the near, mid and far infrared (terahertz) regions from 0.75um (750nm) to 1mm (300GHz.) Submissions in the 300GHz to 100GHz region may be accepted at the editors discretion if their content is relevant to shorter wavelengths. Submissions must be primarily concerned with and directly relevant to this spectral region.
Its core topics can be summarized as the generation, propagation and detection, of infrared radiation; the associated optics, materials and devices; and its use in all fields of science, industry, engineering and medicine.
Infrared techniques occur in many different fields, notably spectroscopy and interferometry; material characterization and processing; atmospheric physics, astronomy and space research. Scientific aspects include lasers, quantum optics, quantum electronics, image processing and semiconductor physics. Some important applications are medical diagnostics and treatment, industrial inspection and environmental monitoring.