CLIP-Based Natural Language-Guided Low-Redundancy Fusion of Infrared and Visible Images

IF 4.3 2区计算机科学 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Consumer Electronics Pub Date : 2025-01-06 DOI:10.1109/TCE.2025.3526792

Jundong Zhang;Kangjian He;Dan Xu;Hongzhen Shi

{"title":"CLIP-Based Natural Language-Guided Low-Redundancy Fusion of Infrared and Visible Images","authors":"Jundong Zhang;Kangjian He;Dan Xu;Hongzhen Shi","doi":"10.1109/TCE.2025.3526792","DOIUrl":null,"url":null,"abstract":"The objective of infrared and visible image fusion is to produce a fused image that encompasses significant objects and intricate textures. However, existing methods frequently prioritize the extraction of complementary information, often overlooking the detrimental effects of redundant features. Moreover, due to the absence of authentic fused images, traditional mathematically defined loss functions face challenges in accurately modeling the characteristics of fused images. To address these challenges, this paper utilizes CLIP to design a natural language-guided, low-redundancy feature infrared and visible image fusion network. On one hand, we designed a Partial Feature Extraction(PFE) block and a Spatial-Channel Reconstruction Screening(SCRS) block to effectively reduce redundant features and enhance the focus on critical features. Additionally, we leveraged the CLIP model to bridge the gap between images and natural language, innovatively crafting a language-driven loss function to guide the fusion process through linguistic expressions. Extensive experiments conducted on multiple public datasets demonstrate that this method outperforms existing advanced techniques in both visual quality and quantitative assessment. Moreover, it achieves superior detection accuracy compared to current methods, reaching an advanced level of performance. The source code will be released at <uri>https://github.com/VCMHE/CNLFusion</uri>.","PeriodicalId":13208,"journal":{"name":"IEEE Transactions on Consumer Electronics","volume":"71 1","pages":"931-944"},"PeriodicalIF":4.3000,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Consumer Electronics","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10829832/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

The objective of infrared and visible image fusion is to produce a fused image that encompasses significant objects and intricate textures. However, existing methods frequently prioritize the extraction of complementary information, often overlooking the detrimental effects of redundant features. Moreover, due to the absence of authentic fused images, traditional mathematically defined loss functions face challenges in accurately modeling the characteristics of fused images. To address these challenges, this paper utilizes CLIP to design a natural language-guided, low-redundancy feature infrared and visible image fusion network. On one hand, we designed a Partial Feature Extraction(PFE) block and a Spatial-Channel Reconstruction Screening(SCRS) block to effectively reduce redundant features and enhance the focus on critical features. Additionally, we leveraged the CLIP model to bridge the gap between images and natural language, innovatively crafting a language-driven loss function to guide the fusion process through linguistic expressions. Extensive experiments conducted on multiple public datasets demonstrate that this method outperforms existing advanced techniques in both visual quality and quantitative assessment. Moreover, it achieves superior detection accuracy compared to current methods, reaching an advanced level of performance. The source code will be released at https://github.com/VCMHE/CNLFusion.

查看原文本刊更多论文

基于自然语言引导的红外与可见光图像低冗余融合

红外和可见光图像融合的目的是产生包含重要目标和复杂纹理的融合图像。然而，现有的方法往往优先考虑互补信息的提取，往往忽略了冗余特征的有害影响。此外，由于缺乏真实的融合图像，传统的数学定义的损失函数在准确建模融合图像的特征方面面临挑战。为了解决这些问题，本文利用CLIP设计了一个自然语言引导、低冗余特征的红外和可见光图像融合网络。一方面，我们设计了部分特征提取（PFE）块和空间通道重构筛选（SCRS）块，有效地减少冗余特征，增强对关键特征的关注；此外，我们利用CLIP模型来弥合图像和自然语言之间的差距，创新地制作了一个语言驱动的损失函数，通过语言表达来指导融合过程。在多个公共数据集上进行的大量实验表明，该方法在视觉质量和定量评估方面都优于现有的先进技术。此外，与现有方法相比，它的检测精度更高，达到了先进的性能水平。源代码将在https://github.com/VCMHE/CNLFusion上发布。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Consumer Electronics 工程技术-电信学

CiteScore

7.70

自引率

9.30%

发文量

审稿时长

3.3 months

期刊介绍： The main focus for the IEEE Transactions on Consumer Electronics is the engineering and research aspects of the theory, design, construction, manufacture or end use of mass market electronics, systems, software and services for consumers.