Qilei Li , Wenhao Song , Mingliang Gao , Wenzhe Zhai , Qiang Zhou , Zhao Huang
{"title":"基于跨模态交互的文本引用多模态图像融合","authors":"Qilei Li , Wenhao Song , Mingliang Gao , Wenzhe Zhai , Qiang Zhou , Zhao Huang","doi":"10.1016/j.sigpro.2025.110073","DOIUrl":null,"url":null,"abstract":"<div><div>Multi-modal image fusion aims to generate a fused image that possesses the advantage of the source images in different modalities. The fused image is capable of significantly facilitating high-level vision tasks, <em>e.g.,</em> image segmentation and object detection. However, most existing fusion methods generally focus on preserving the structure and detailed representation of the fused images while failing to integrate the high-level semantic information in the source images. To address this problem, we propose a text-guided multi-modal image fusion framework, termed Cross-Modality Interaction (CMI)-Fusion. The proposed model leverages the robust capabilities of a large-scale foundation model, <em>i.e.,</em> Contrastive Language–Image Pre-training (CLIP), to achieve efficient interaction between image detail and text prompts. Specifically, a Dual Attention Feature Extraction (DAFE) module is derived to extract representative visual and semantic features. Moreover, a cross-modality Image-Text Interaction (ITI) module is derived to achieve a dynamic interaction between the image and corresponding text features. Extensive experiments on various multi-modal datasets demonstrate that the proposed CMI-Fusion retains image structural details and semantic content compared to the state-of-the-art methods. The code is available at <span><span>https://github.com/songwenhao123/CMI-Fusion</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49523,"journal":{"name":"Signal Processing","volume":"237 ","pages":"Article 110073"},"PeriodicalIF":3.4000,"publicationDate":"2025-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Towards text-refereed multi-modal image fusion by cross-modality interaction\",\"authors\":\"Qilei Li , Wenhao Song , Mingliang Gao , Wenzhe Zhai , Qiang Zhou , Zhao Huang\",\"doi\":\"10.1016/j.sigpro.2025.110073\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Multi-modal image fusion aims to generate a fused image that possesses the advantage of the source images in different modalities. The fused image is capable of significantly facilitating high-level vision tasks, <em>e.g.,</em> image segmentation and object detection. However, most existing fusion methods generally focus on preserving the structure and detailed representation of the fused images while failing to integrate the high-level semantic information in the source images. To address this problem, we propose a text-guided multi-modal image fusion framework, termed Cross-Modality Interaction (CMI)-Fusion. The proposed model leverages the robust capabilities of a large-scale foundation model, <em>i.e.,</em> Contrastive Language–Image Pre-training (CLIP), to achieve efficient interaction between image detail and text prompts. Specifically, a Dual Attention Feature Extraction (DAFE) module is derived to extract representative visual and semantic features. Moreover, a cross-modality Image-Text Interaction (ITI) module is derived to achieve a dynamic interaction between the image and corresponding text features. Extensive experiments on various multi-modal datasets demonstrate that the proposed CMI-Fusion retains image structural details and semantic content compared to the state-of-the-art methods. The code is available at <span><span>https://github.com/songwenhao123/CMI-Fusion</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":49523,\"journal\":{\"name\":\"Signal Processing\",\"volume\":\"237 \",\"pages\":\"Article 110073\"},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2025-05-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Signal Processing\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0165168425001872\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0165168425001872","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
Towards text-refereed multi-modal image fusion by cross-modality interaction
Multi-modal image fusion aims to generate a fused image that possesses the advantage of the source images in different modalities. The fused image is capable of significantly facilitating high-level vision tasks, e.g., image segmentation and object detection. However, most existing fusion methods generally focus on preserving the structure and detailed representation of the fused images while failing to integrate the high-level semantic information in the source images. To address this problem, we propose a text-guided multi-modal image fusion framework, termed Cross-Modality Interaction (CMI)-Fusion. The proposed model leverages the robust capabilities of a large-scale foundation model, i.e., Contrastive Language–Image Pre-training (CLIP), to achieve efficient interaction between image detail and text prompts. Specifically, a Dual Attention Feature Extraction (DAFE) module is derived to extract representative visual and semantic features. Moreover, a cross-modality Image-Text Interaction (ITI) module is derived to achieve a dynamic interaction between the image and corresponding text features. Extensive experiments on various multi-modal datasets demonstrate that the proposed CMI-Fusion retains image structural details and semantic content compared to the state-of-the-art methods. The code is available at https://github.com/songwenhao123/CMI-Fusion.
期刊介绍:
Signal Processing incorporates all aspects of the theory and practice of signal processing. It features original research work, tutorial and review articles, and accounts of practical developments. It is intended for a rapid dissemination of knowledge and experience to engineers and scientists working in the research, development or practical application of signal processing.
Subject areas covered by the journal include: Signal Theory; Stochastic Processes; Detection and Estimation; Spectral Analysis; Filtering; Signal Processing Systems; Software Developments; Image Processing; Pattern Recognition; Optical Signal Processing; Digital Signal Processing; Multi-dimensional Signal Processing; Communication Signal Processing; Biomedical Signal Processing; Geophysical and Astrophysical Signal Processing; Earth Resources Signal Processing; Acoustic and Vibration Signal Processing; Data Processing; Remote Sensing; Signal Processing Technology; Radar Signal Processing; Sonar Signal Processing; Industrial Applications; New Applications.