Wei Huang , Hengjiang Li , Fan Qin , Jingpeng Li , Sizhuo Wang , Pengfei Yang , Luan Zhang , Yunshuang Fan , Jing Guo , Kaiwen Cheng , Huafu Chen
{"title":"MFA-NRM: A novel framework for multimodal fusion and semantic alignment in visual neural decoding","authors":"Wei Huang , Hengjiang Li , Fan Qin , Jingpeng Li , Sizhuo Wang , Pengfei Yang , Luan Zhang , Yunshuang Fan , Jing Guo , Kaiwen Cheng , Huafu Chen","doi":"10.1016/j.inffus.2025.103717","DOIUrl":null,"url":null,"abstract":"<div><div>Integrating multimodal semantic features, such as images and text, to enhance visual neural representations has proven to be an effective strategy in brain visual decoding. However, previous studies have either focused solely on unimodal enhancement techniques or have inadequately addressed the alignment ambiguity between different modalities, leading to an underutilization of the complementary benefits of multimodal features or a reduction in the semantic richness of the resulting neural representations. To address these limitations, we propose a Multimodal Fusion Alignment Neural Representation Model (MFA-NRM), which enhances visual neural decoding by integrating multimodal semantic features from images and text. The MFA-NRM incorporates a fusion module that utilizes a Variational Autoencoder (VAE) and a self-attention mechanism to integrate multimodal features into a unified latent space, thereby facilitating robust semantic alignment with neural activity. Additionally, we introduce prompt techniques that adapt neural representations to individual differences, improving cross-subject generalization. Our approach also leverages the semantic knowledge from ten large pre-trained models to further enhance performance. Experimental results on the Natural Scenes Dataset (NSD) show that, compared to unimodal alignment methods, our method improves recognition tasks by 18.8 % and classification tasks by 4.30 %, compared to other multimodal alignment methods without the fusion module, our approach improves recognition tasks by 33.59 % and classification tasks by 4.26 %. These findings indicate that the MFA-NRM effectively resolves the problem of alignment ambiguity and enables richer semantic extraction from brain responses to multimodal visual stimuli, offering new perspectives for visual neural decoding.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"127 ","pages":"Article 103717"},"PeriodicalIF":15.5000,"publicationDate":"2025-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525007766","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Integrating multimodal semantic features, such as images and text, to enhance visual neural representations has proven to be an effective strategy in brain visual decoding. However, previous studies have either focused solely on unimodal enhancement techniques or have inadequately addressed the alignment ambiguity between different modalities, leading to an underutilization of the complementary benefits of multimodal features or a reduction in the semantic richness of the resulting neural representations. To address these limitations, we propose a Multimodal Fusion Alignment Neural Representation Model (MFA-NRM), which enhances visual neural decoding by integrating multimodal semantic features from images and text. The MFA-NRM incorporates a fusion module that utilizes a Variational Autoencoder (VAE) and a self-attention mechanism to integrate multimodal features into a unified latent space, thereby facilitating robust semantic alignment with neural activity. Additionally, we introduce prompt techniques that adapt neural representations to individual differences, improving cross-subject generalization. Our approach also leverages the semantic knowledge from ten large pre-trained models to further enhance performance. Experimental results on the Natural Scenes Dataset (NSD) show that, compared to unimodal alignment methods, our method improves recognition tasks by 18.8 % and classification tasks by 4.30 %, compared to other multimodal alignment methods without the fusion module, our approach improves recognition tasks by 33.59 % and classification tasks by 4.26 %. These findings indicate that the MFA-NRM effectively resolves the problem of alignment ambiguity and enables richer semantic extraction from brain responses to multimodal visual stimuli, offering new perspectives for visual neural decoding.
期刊介绍:
Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.