基于跨模态交互的文本引用多模态图像融合

IF 3.4 2区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

Signal Processing Pub Date : 2025-05-06 DOI:10.1016/j.sigpro.2025.110073

Qilei Li , Wenhao Song , Mingliang Gao , Wenzhe Zhai , Qiang Zhou , Zhao Huang

{"title":"基于跨模态交互的文本引用多模态图像融合","authors":"Qilei Li , Wenhao Song , Mingliang Gao , Wenzhe Zhai , Qiang Zhou , Zhao Huang","doi":"10.1016/j.sigpro.2025.110073","DOIUrl":null,"url":null,"abstract":"<div><div>Multi-modal image fusion aims to generate a fused image that possesses the advantage of the source images in different modalities. The fused image is capable of significantly facilitating high-level vision tasks, <em>e.g.,</em> image segmentation and object detection. However, most existing fusion methods generally focus on preserving the structure and detailed representation of the fused images while failing to integrate the high-level semantic information in the source images. To address this problem, we propose a text-guided multi-modal image fusion framework, termed Cross-Modality Interaction (CMI)-Fusion. The proposed model leverages the robust capabilities of a large-scale foundation model, <em>i.e.,</em> Contrastive Language–Image Pre-training (CLIP), to achieve efficient interaction between image detail and text prompts. Specifically, a Dual Attention Feature Extraction (DAFE) module is derived to extract representative visual and semantic features. Moreover, a cross-modality Image-Text Interaction (ITI) module is derived to achieve a dynamic interaction between the image and corresponding text features. Extensive experiments on various multi-modal datasets demonstrate that the proposed CMI-Fusion retains image structural details and semantic content compared to the state-of-the-art methods. The code is available at <span><span>https://github.com/songwenhao123/CMI-Fusion</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49523,"journal":{"name":"Signal Processing","volume":"237 ","pages":"Article 110073"},"PeriodicalIF":3.4000,"publicationDate":"2025-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Towards text-refereed multi-modal image fusion by cross-modality interaction\",\"authors\":\"Qilei Li , Wenhao Song , Mingliang Gao , Wenzhe Zhai , Qiang Zhou , Zhao Huang\",\"doi\":\"10.1016/j.sigpro.2025.110073\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Multi-modal image fusion aims to generate a fused image that possesses the advantage of the source images in different modalities. The fused image is capable of significantly facilitating high-level vision tasks, <em>e.g.,</em> image segmentation and object detection. However, most existing fusion methods generally focus on preserving the structure and detailed representation of the fused images while failing to integrate the high-level semantic information in the source images. To address this problem, we propose a text-guided multi-modal image fusion framework, termed Cross-Modality Interaction (CMI)-Fusion. The proposed model leverages the robust capabilities of a large-scale foundation model, <em>i.e.,</em> Contrastive Language–Image Pre-training (CLIP), to achieve efficient interaction between image detail and text prompts. Specifically, a Dual Attention Feature Extraction (DAFE) module is derived to extract representative visual and semantic features. Moreover, a cross-modality Image-Text Interaction (ITI) module is derived to achieve a dynamic interaction between the image and corresponding text features. Extensive experiments on various multi-modal datasets demonstrate that the proposed CMI-Fusion retains image structural details and semantic content compared to the state-of-the-art methods. The code is available at <span><span>https://github.com/songwenhao123/CMI-Fusion</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":49523,\"journal\":{\"name\":\"Signal Processing\",\"volume\":\"237 \",\"pages\":\"Article 110073\"},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2025-05-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Signal Processing\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0165168425001872\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0165168425001872","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

摘要

多模态图像融合的目的是生成具有不同模态源图像优点的融合图像。融合后的图像能够显著促进高级视觉任务，例如图像分割和目标检测。然而，现有的融合方法大多侧重于保留融合图像的结构和细节表示，而不能融合源图像中的高级语义信息。为了解决这个问题，我们提出了一个文本引导的多模态图像融合框架，称为跨模态交互(CMI)-融合。该模型利用大规模基础模型的鲁棒性，即对比语言-图像预训练（CLIP），实现图像细节和文本提示之间的有效交互。具体而言，推导了双重注意特征提取（Dual Attention Feature Extraction， DAFE）模块，用于提取具有代表性的视觉和语义特征。此外，推导了跨模态的图像-文本交互（ITI）模块，实现了图像与相应文本特征之间的动态交互。在各种多模态数据集上进行的大量实验表明，与最先进的方法相比，所提出的CMI-Fusion保留了图像的结构细节和语义内容。代码可在https://github.com/songwenhao123/CMI-Fusion上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Towards text-refereed multi-modal image fusion by cross-modality interaction

Multi-modal image fusion aims to generate a fused image that possesses the advantage of the source images in different modalities. The fused image is capable of significantly facilitating high-level vision tasks, e.g., image segmentation and object detection. However, most existing fusion methods generally focus on preserving the structure and detailed representation of the fused images while failing to integrate the high-level semantic information in the source images. To address this problem, we propose a text-guided multi-modal image fusion framework, termed Cross-Modality Interaction (CMI)-Fusion. The proposed model leverages the robust capabilities of a large-scale foundation model, i.e., Contrastive Language–Image Pre-training (CLIP), to achieve efficient interaction between image detail and text prompts. Specifically, a Dual Attention Feature Extraction (DAFE) module is derived to extract representative visual and semantic features. Moreover, a cross-modality Image-Text Interaction (ITI) module is derived to achieve a dynamic interaction between the image and corresponding text features. Extensive experiments on various multi-modal datasets demonstrate that the proposed CMI-Fusion retains image structural details and semantic content compared to the state-of-the-art methods. The code is available at https://github.com/songwenhao123/CMI-Fusion.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Signal Processing 工程技术-工程：电子与电气

CiteScore

9.20

自引率

9.10%

发文量

309

审稿时长

41 days

期刊介绍： Signal Processing incorporates all aspects of the theory and practice of signal processing. It features original research work, tutorial and review articles, and accounts of practical developments. It is intended for a rapid dissemination of knowledge and experience to engineers and scientists working in the research, development or practical application of signal processing. Subject areas covered by the journal include: Signal Theory; Stochastic Processes; Detection and Estimation; Spectral Analysis; Filtering; Signal Processing Systems; Software Developments; Image Processing; Pattern Recognition; Optical Signal Processing; Digital Signal Processing; Multi-dimensional Signal Processing; Communication Signal Processing; Biomedical Signal Processing; Geophysical and Astrophysical Signal Processing; Earth Resources Signal Processing; Acoustic and Vibration Signal Processing; Data Processing; Remote Sensing; Signal Processing Technology; Radar Signal Processing; Sonar Signal Processing; Industrial Applications; New Applications.