SMDFusion: A Self-Supervised Fusion for Infrared and Visible Images via Cross-Modal Random Noise Masked Encoding and Difference Perception

IF 10.9 2区计算机科学 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Consumer Electronics Pub Date : 2025-04-29 DOI:10.1109/TCE.2025.3565680

Mingchuan Tan;Rencan Nie;Jinde Cao;Ying Zhang

{"title":"SMDFusion: A Self-Supervised Fusion for Infrared and Visible Images via Cross-Modal Random Noise Masked Encoding and Difference Perception","authors":"Mingchuan Tan;Rencan Nie;Jinde Cao;Ying Zhang","doi":"10.1109/TCE.2025.3565680","DOIUrl":null,"url":null,"abstract":"Infrared and visible image fusion (IVIF) aims to merge images from both modalities of the same scene into a single image, enabling comprehensive information display and better support for visual computing tasks. Nevertheless, existing methods often overlook pixel-level relationships and struggle to effectively eliminate redundant information. To this end, we propose SMDFusion, a novel framework for fusing infrared and visible images using cross-modal noise-masked encoding and cross-modal differential perception information coupling. The framework consists of a self-supervised learning network (SSLN) and an unsupervised fusion network (UFN). Regarding the SSLN, the noise random masked encoder learns pixel-level relationships by employing a grid structure for multi-scale feature mapping that facilitates information exchange among different scales. The network is optimized with a self-supervision strategy for better representation learning. As for the UFN, symmetrical grid structures and multi-scale attention mechanisms are utilized to integrate intra-modal features while the cross-modal difference perception (CDP) module eliminates redundant information between modalities and conditionally captures complementary perception. The fusion image is synthesized by computing the modality-specific contribution estimation. Qualitative and quantitative experimental results demonstrate that SMDFusion outperforms representative methods in the task of multi-modal information fusion as well as supporting downstream tasks. The code is available at:<uri>https://github.com/rcnie/IVIF-SMDFusion</uri>.","PeriodicalId":13208,"journal":{"name":"IEEE Transactions on Consumer Electronics","volume":"71 2","pages":"2579-2591"},"PeriodicalIF":10.9000,"publicationDate":"2025-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Consumer Electronics","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10979991/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Infrared and visible image fusion (IVIF) aims to merge images from both modalities of the same scene into a single image, enabling comprehensive information display and better support for visual computing tasks. Nevertheless, existing methods often overlook pixel-level relationships and struggle to effectively eliminate redundant information. To this end, we propose SMDFusion, a novel framework for fusing infrared and visible images using cross-modal noise-masked encoding and cross-modal differential perception information coupling. The framework consists of a self-supervised learning network (SSLN) and an unsupervised fusion network (UFN). Regarding the SSLN, the noise random masked encoder learns pixel-level relationships by employing a grid structure for multi-scale feature mapping that facilitates information exchange among different scales. The network is optimized with a self-supervision strategy for better representation learning. As for the UFN, symmetrical grid structures and multi-scale attention mechanisms are utilized to integrate intra-modal features while the cross-modal difference perception (CDP) module eliminates redundant information between modalities and conditionally captures complementary perception. The fusion image is synthesized by computing the modality-specific contribution estimation. Qualitative and quantitative experimental results demonstrate that SMDFusion outperforms representative methods in the task of multi-modal information fusion as well as supporting downstream tasks. The code is available at:https://github.com/rcnie/IVIF-SMDFusion.

查看原文本刊更多论文

SMDFusion：一种基于交叉模态随机噪声掩蔽编码和差异感知的自监督红外和可见光图像融合

红外和可见光图像融合（IVIF）旨在将同一场景的两种模式的图像合并为单个图像，从而实现全面的信息显示并更好地支持视觉计算任务。然而，现有的方法往往忽略了像素级的关系，难以有效地消除冗余信息。为此，我们提出了SMDFusion，这是一种利用跨模态噪声掩盖编码和跨模态差分感知信息耦合融合红外和可见光图像的新框架。该框架由自监督学习网络（SSLN）和无监督融合网络（UFN）组成。对于SSLN，噪声随机掩码编码器通过采用网格结构进行多尺度特征映射来学习像素级关系，从而促进不同尺度之间的信息交换。网络采用自我监督策略进行优化，以获得更好的表示学习。对于un，利用对称网格结构和多尺度注意机制来整合模态内特征，而跨模态差异感知（CDP）模块消除模态之间的冗余信息并有条件地捕获互补感知。通过计算模态贡献估计合成融合图像。定性和定量实验结果表明，SMDFusion在多模态信息融合任务中优于代表性方法，并支持下游任务。代码可从https://github.com/rcnie/IVIF-SMDFusion获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Consumer Electronics 工程技术-电信学

CiteScore

7.70

自引率

9.30%

发文量

审稿时长

3.3 months

期刊介绍： The main focus for the IEEE Transactions on Consumer Electronics is the engineering and research aspects of the theory, design, construction, manufacture or end use of mass market electronics, systems, software and services for consumers.