利用统一语言和掩码引导的高效图像融合网络。

IF 20.8 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Pattern Analysis and Machine Intelligence Pub Date : 2025-07-23 DOI:10.1109/tpami.2025.3591930

Zi-Han Cao,Yu-Jie Liang,Liang-Jian Deng,Gemine Vivone

{"title":"利用统一语言和掩码引导的高效图像融合网络。","authors":"Zi-Han Cao,Yu-Jie Liang,Liang-Jian Deng,Gemine Vivone","doi":"10.1109/tpami.2025.3591930","DOIUrl":null,"url":null,"abstract":"Image fusion aims to merge image pairs collected by different sensors over the same scene, preserving their distinct features. Recent works have often focused on designing various image fusion losses, developing different network architectures, and leveraging downstream tasks (e.g., object detection) for image fusion. However, a few studies have explored how language and semantic masks can serve as guidance to aid image fusion. In this paper, we investigate how the combination of language and masks can guide image fusion tasks, discarding the previously complex frameworks, which rely on downstream tasks, GAN-based cycle training, diffusion models, or deep image priors. Additionally, we exploit a recurrent neural network-like architecture to build a lightweight network that avoids the quadratic-cost of traditional attention mechanisms. To adapt the receptance weighted key value (RWKV) model to an image modality, we modify it into a bidirectional version using an efficient scanning strategy (ESS). To guide image fusion by language and mask features, we introduce a multi-modal fusion module (MFM) to facilitate information exchange. Comprehensive experiments show that the proposed framework achieved state-of-the-art results in various image fusion tasks (i.e., visible-infrared image fusion, multi-focus image fusion, multi-exposure image fusion, medical image fusion, hyperspectral and multispectral image fusion, and pansharpening). Code will be available at https://github.com/294coder/RWKVFusion.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"115 1","pages":""},"PeriodicalIF":20.8000,"publicationDate":"2025-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An Efficient Image Fusion Network Exploiting Unifying Language and Mask Guidance.\",\"authors\":\"Zi-Han Cao,Yu-Jie Liang,Liang-Jian Deng,Gemine Vivone\",\"doi\":\"10.1109/tpami.2025.3591930\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Image fusion aims to merge image pairs collected by different sensors over the same scene, preserving their distinct features. Recent works have often focused on designing various image fusion losses, developing different network architectures, and leveraging downstream tasks (e.g., object detection) for image fusion. However, a few studies have explored how language and semantic masks can serve as guidance to aid image fusion. In this paper, we investigate how the combination of language and masks can guide image fusion tasks, discarding the previously complex frameworks, which rely on downstream tasks, GAN-based cycle training, diffusion models, or deep image priors. Additionally, we exploit a recurrent neural network-like architecture to build a lightweight network that avoids the quadratic-cost of traditional attention mechanisms. To adapt the receptance weighted key value (RWKV) model to an image modality, we modify it into a bidirectional version using an efficient scanning strategy (ESS). To guide image fusion by language and mask features, we introduce a multi-modal fusion module (MFM) to facilitate information exchange. Comprehensive experiments show that the proposed framework achieved state-of-the-art results in various image fusion tasks (i.e., visible-infrared image fusion, multi-focus image fusion, multi-exposure image fusion, medical image fusion, hyperspectral and multispectral image fusion, and pansharpening). Code will be available at https://github.com/294coder/RWKVFusion.\",\"PeriodicalId\":13426,\"journal\":{\"name\":\"IEEE Transactions on Pattern Analysis and Machine Intelligence\",\"volume\":\"115 1\",\"pages\":\"\"},\"PeriodicalIF\":20.8000,\"publicationDate\":\"2025-07-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Pattern Analysis and Machine Intelligence\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1109/tpami.2025.3591930\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Pattern Analysis and Machine Intelligence","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/tpami.2025.3591930","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

图像融合的目的是将同一场景中不同传感器采集的图像对合并在一起，同时保留其鲜明的特征。最近的工作通常集中在设计各种图像融合损失，开发不同的网络架构，以及利用下游任务（例如，目标检测）进行图像融合。然而，一些研究已经探索了语言和语义掩模如何作为辅助图像融合的指导。在本文中，我们研究了语言和掩码的结合如何指导图像融合任务，抛弃了以前依赖于下游任务、基于gan的循环训练、扩散模型或深度图像先验的复杂框架。此外，我们利用类似循环神经网络的架构来构建轻量级网络，以避免传统注意力机制的二次代价。为了使接收加权键值（RWKV）模型适应图像模态，我们使用有效扫描策略（ESS）将其修改为双向版本。为了利用语言和掩模特征来指导图像融合，我们引入了一个多模态融合模块（MFM）来促进信息交换。综合实验表明，该框架在可见光-红外图像融合、多聚焦图像融合、多曝光图像融合、医学图像融合、高光谱与多光谱图像融合、泛锐化等多种图像融合任务中均取得了较好的效果。代码将在https://github.com/294coder/RWKVFusion上提供。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

An Efficient Image Fusion Network Exploiting Unifying Language and Mask Guidance.

Image fusion aims to merge image pairs collected by different sensors over the same scene, preserving their distinct features. Recent works have often focused on designing various image fusion losses, developing different network architectures, and leveraging downstream tasks (e.g., object detection) for image fusion. However, a few studies have explored how language and semantic masks can serve as guidance to aid image fusion. In this paper, we investigate how the combination of language and masks can guide image fusion tasks, discarding the previously complex frameworks, which rely on downstream tasks, GAN-based cycle training, diffusion models, or deep image priors. Additionally, we exploit a recurrent neural network-like architecture to build a lightweight network that avoids the quadratic-cost of traditional attention mechanisms. To adapt the receptance weighted key value (RWKV) model to an image modality, we modify it into a bidirectional version using an efficient scanning strategy (ESS). To guide image fusion by language and mask features, we introduce a multi-modal fusion module (MFM) to facilitate information exchange. Comprehensive experiments show that the proposed framework achieved state-of-the-art results in various image fusion tasks (i.e., visible-infrared image fusion, multi-focus image fusion, multi-exposure image fusion, medical image fusion, hyperspectral and multispectral image fusion, and pansharpening). Code will be available at https://github.com/294coder/RWKVFusion.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Pattern Analysis and Machine Intelligence 工程技术-工程：电子与电气

CiteScore

28.40

自引率

3.00%

发文量

885

审稿时长

8.5 months

期刊介绍： The IEEE Transactions on Pattern Analysis and Machine Intelligence publishes articles on all traditional areas of computer vision and image understanding, all traditional areas of pattern analysis and recognition, and selected areas of machine intelligence, with a particular emphasis on machine learning for pattern analysis. Areas such as techniques for visual search, document and handwriting analysis, medical image analysis, video and image sequence analysis, content-based retrieval of image and video, face and gesture recognition and relevant specialized hardware and/or software architectures are also covered.