Self-distillation guided Semantic Knowledge Feedback network for infrared–visible image fusion

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing Pub Date : 2025-05-03 DOI:10.1016/j.imavis.2025.105566

Wei Zhou , Yingyuan Wang , Lina Zuo , Dan Ma , Yugen Yi

{"title":"Self-distillation guided Semantic Knowledge Feedback network for infrared–visible image fusion","authors":"Wei Zhou , Yingyuan Wang , Lina Zuo , Dan Ma , Yugen Yi","doi":"10.1016/j.imavis.2025.105566","DOIUrl":null,"url":null,"abstract":"<div><div>Infrared–visible image fusion combines complementary information from both modalities to enhance visual quality and support downstream tasks. However, existing methods typically enhance semantic information by designing fusion functions for source images and combining them with downstream network, overlooking the optimization and guidance of the fused image itself. This neglect weakens the semantic knowledge within the fused image, limiting its alignment with task objectives and reducing accuracy in downstream tasks. To overcome these limitations, we propose the self-distillation guided Semantic Knowledge Feedback (SKFFusion) network, which extracts semantic knowledge from the fused image and feeds it back to iteratively optimize the fusion process, addressing the lack of semantic guidance. Specifically, we introduce shallow-to-deep feature fusion modules, including Shallow Texture Fusion (STF) and Deep Semantic Fusion (DSF) to integrate fine-grained details and high-level semantics. The STF uses channel and spatial attention mechanisms to aggregate detailed multi-modal information, while the DSF leverages a Mamba structure to capture long-range dependencies, enabling deeper cross-modal semantic fusion. Additionally, we design a CNN-Transformer-based Knowledge Feedback Network (KFN) to extract local detail features and capture global dependencies. A Semantic Attention Guidance (SAG) further refines the fused image’s semantic representation, aligning it with task objectives. Finally, a distillation loss provides more robust training and excellent image quality. Experimental results show that SKFFusion outperforms existing methods in visual quality and vision task performance, particularly under challenging conditions like low-light and fog. Our code is available at <span><span>https://github.com/yyzzttkkjj/SKFFusion</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"159 ","pages":"Article 105566"},"PeriodicalIF":4.2000,"publicationDate":"2025-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625001544","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Infrared–visible image fusion combines complementary information from both modalities to enhance visual quality and support downstream tasks. However, existing methods typically enhance semantic information by designing fusion functions for source images and combining them with downstream network, overlooking the optimization and guidance of the fused image itself. This neglect weakens the semantic knowledge within the fused image, limiting its alignment with task objectives and reducing accuracy in downstream tasks. To overcome these limitations, we propose the self-distillation guided Semantic Knowledge Feedback (SKFFusion) network, which extracts semantic knowledge from the fused image and feeds it back to iteratively optimize the fusion process, addressing the lack of semantic guidance. Specifically, we introduce shallow-to-deep feature fusion modules, including Shallow Texture Fusion (STF) and Deep Semantic Fusion (DSF) to integrate fine-grained details and high-level semantics. The STF uses channel and spatial attention mechanisms to aggregate detailed multi-modal information, while the DSF leverages a Mamba structure to capture long-range dependencies, enabling deeper cross-modal semantic fusion. Additionally, we design a CNN-Transformer-based Knowledge Feedback Network (KFN) to extract local detail features and capture global dependencies. A Semantic Attention Guidance (SAG) further refines the fused image’s semantic representation, aligning it with task objectives. Finally, a distillation loss provides more robust training and excellent image quality. Experimental results show that SKFFusion outperforms existing methods in visual quality and vision task performance, particularly under challenging conditions like low-light and fog. Our code is available at https://github.com/yyzzttkkjj/SKFFusion.

查看原文本刊更多论文

自蒸馏引导语义知识反馈网络用于红外可见图像融合

红外-可见光图像融合结合了两种模式的互补信息，以提高视觉质量并支持下游任务。然而，现有方法通常通过对源图像设计融合函数并将其与下游网络相结合来增强语义信息，而忽略了融合图像本身的优化和引导。这种忽略削弱了融合图像中的语义知识，限制了其与任务目标的一致性，降低了下游任务的准确性。为了克服这些限制，我们提出了自蒸馏引导语义知识反馈（SKFFusion）网络，该网络从融合图像中提取语义知识并反馈给迭代优化融合过程，解决了语义指导的不足。具体来说，我们引入了浅到深的特征融合模块，包括浅纹理融合（STF）和深语义融合（DSF），以整合细粒度细节和高级语义。STF使用通道和空间注意机制来聚合详细的多模态信息，而DSF利用曼巴结构来捕获远程依赖关系，从而实现更深层次的跨模态语义融合。此外，我们设计了一个基于cnn - transformer的知识反馈网络（KFN）来提取局部细节特征并捕获全局依赖关系。语义注意指导（SAG）进一步细化融合图像的语义表示，使其与任务目标保持一致。最后，蒸馏损失提供了更鲁棒的训练和良好的图像质量。实验结果表明，SKFFusion在视觉质量和视觉任务性能方面优于现有方法，特别是在低光和雾等具有挑战性的条件下。我们的代码可在https://github.com/yyzzttkkjj/SKFFusion上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Image and Vision Computing 工程技术-工程：电子与电气

CiteScore

8.50

自引率

8.50%

发文量

143

审稿时长

7.8 months

期刊介绍： Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.