Hongbin Yu , Xiangcan Du , Wei Song , Haojie Zhou , Junyi Zhang
{"title":"A collaborative spatial–frequency learning network for infrared and visible image fusion","authors":"Hongbin Yu , Xiangcan Du , Wei Song , Haojie Zhou , Junyi Zhang","doi":"10.1016/j.patcog.2025.112480","DOIUrl":null,"url":null,"abstract":"<div><div>Most existing deep fusion models operate predominantly in the spatial domain, which limits their ability to effectively preserve texture details. In contrast, methods that incorporate frequency-domain information often suffer from inadequate interaction with spatial-domain features, thereby constraining overall fusion performance. To address these limitations, we propose a Collaborative Spatial-Frequency Learning Network (CSFNet) for infrared and visible image fusion. In the frequency-domain learning branch, we introduce a frequency refinement module based on wavelet transform to enable cross-band feature interaction and facilitate effective multi-scale feature fusion. In the spatial-domain branch, we embed a learnable low-rank decomposition model that extracts low-rank features from infrared images and sparse detail features from visible images, forming the basis of a dedicated spatial feature extraction module. Additionally, an information aggregation module is designed to learn complementary representations and integrate cross-domain features efficiently. To validate the effectiveness of the proposed approach, we conducted extensive experiments on three publicly available datasets: MSRS, TNO, and RoadScene, and compared CSFNet with sixteen state-of-the-art (SOTA) fusion methods. On the MSRS dataset, CSFNet achieved favorable results, with a mean and standard deviation of SF = 12.2108 <span><math><mo>±</mo></math></span> 3.8706, VIF = 1.0232 <span><math><mo>±</mo></math></span> 0.1397, Qabf = 0.7112 <span><math><mo>±</mo></math></span> 0.0397, SSIM = 0.6909 <span><math><mo>±</mo></math></span> 0.0859, PSNR = 17.6517 <span><math><mo>±</mo></math></span> 3.8767, and AG = 4.0243 <span><math><mo>±</mo></math></span> 1.5465. The minimum performance improvement over SOTA methods was 1.64 %, while the maximum gain reached 108.82 %. Furthermore, CSFNet demonstrated superior performance on a downstream semantic segmentation task.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"172 ","pages":"Article 112480"},"PeriodicalIF":7.6000,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320325011434","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Most existing deep fusion models operate predominantly in the spatial domain, which limits their ability to effectively preserve texture details. In contrast, methods that incorporate frequency-domain information often suffer from inadequate interaction with spatial-domain features, thereby constraining overall fusion performance. To address these limitations, we propose a Collaborative Spatial-Frequency Learning Network (CSFNet) for infrared and visible image fusion. In the frequency-domain learning branch, we introduce a frequency refinement module based on wavelet transform to enable cross-band feature interaction and facilitate effective multi-scale feature fusion. In the spatial-domain branch, we embed a learnable low-rank decomposition model that extracts low-rank features from infrared images and sparse detail features from visible images, forming the basis of a dedicated spatial feature extraction module. Additionally, an information aggregation module is designed to learn complementary representations and integrate cross-domain features efficiently. To validate the effectiveness of the proposed approach, we conducted extensive experiments on three publicly available datasets: MSRS, TNO, and RoadScene, and compared CSFNet with sixteen state-of-the-art (SOTA) fusion methods. On the MSRS dataset, CSFNet achieved favorable results, with a mean and standard deviation of SF = 12.2108 3.8706, VIF = 1.0232 0.1397, Qabf = 0.7112 0.0397, SSIM = 0.6909 0.0859, PSNR = 17.6517 3.8767, and AG = 4.0243 1.5465. The minimum performance improvement over SOTA methods was 1.64 %, while the maximum gain reached 108.82 %. Furthermore, CSFNet demonstrated superior performance on a downstream semantic segmentation task.
期刊介绍:
The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.