MFA-NRM：视觉神经解码中多模态融合和语义对齐的新框架

IF 15.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Information Fusion Pub Date : 2025-09-12 DOI:10.1016/j.inffus.2025.103717

Wei Huang , Hengjiang Li , Fan Qin , Jingpeng Li , Sizhuo Wang , Pengfei Yang , Luan Zhang , Yunshuang Fan , Jing Guo , Kaiwen Cheng , Huafu Chen

{"title":"MFA-NRM：视觉神经解码中多模态融合和语义对齐的新框架","authors":"Wei Huang , Hengjiang Li , Fan Qin , Jingpeng Li , Sizhuo Wang , Pengfei Yang , Luan Zhang , Yunshuang Fan , Jing Guo , Kaiwen Cheng , Huafu Chen","doi":"10.1016/j.inffus.2025.103717","DOIUrl":null,"url":null,"abstract":"<div><div>Integrating multimodal semantic features, such as images and text, to enhance visual neural representations has proven to be an effective strategy in brain visual decoding. However, previous studies have either focused solely on unimodal enhancement techniques or have inadequately addressed the alignment ambiguity between different modalities, leading to an underutilization of the complementary benefits of multimodal features or a reduction in the semantic richness of the resulting neural representations. To address these limitations, we propose a Multimodal Fusion Alignment Neural Representation Model (MFA-NRM), which enhances visual neural decoding by integrating multimodal semantic features from images and text. The MFA-NRM incorporates a fusion module that utilizes a Variational Autoencoder (VAE) and a self-attention mechanism to integrate multimodal features into a unified latent space, thereby facilitating robust semantic alignment with neural activity. Additionally, we introduce prompt techniques that adapt neural representations to individual differences, improving cross-subject generalization. Our approach also leverages the semantic knowledge from ten large pre-trained models to further enhance performance. Experimental results on the Natural Scenes Dataset (NSD) show that, compared to unimodal alignment methods, our method improves recognition tasks by 18.8 % and classification tasks by 4.30 %, compared to other multimodal alignment methods without the fusion module, our approach improves recognition tasks by 33.59 % and classification tasks by 4.26 %. These findings indicate that the MFA-NRM effectively resolves the problem of alignment ambiguity and enables richer semantic extraction from brain responses to multimodal visual stimuli, offering new perspectives for visual neural decoding.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"127 ","pages":"Article 103717"},"PeriodicalIF":15.5000,"publicationDate":"2025-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MFA-NRM: A novel framework for multimodal fusion and semantic alignment in visual neural decoding\",\"authors\":\"Wei Huang , Hengjiang Li , Fan Qin , Jingpeng Li , Sizhuo Wang , Pengfei Yang , Luan Zhang , Yunshuang Fan , Jing Guo , Kaiwen Cheng , Huafu Chen\",\"doi\":\"10.1016/j.inffus.2025.103717\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Integrating multimodal semantic features, such as images and text, to enhance visual neural representations has proven to be an effective strategy in brain visual decoding. However, previous studies have either focused solely on unimodal enhancement techniques or have inadequately addressed the alignment ambiguity between different modalities, leading to an underutilization of the complementary benefits of multimodal features or a reduction in the semantic richness of the resulting neural representations. To address these limitations, we propose a Multimodal Fusion Alignment Neural Representation Model (MFA-NRM), which enhances visual neural decoding by integrating multimodal semantic features from images and text. The MFA-NRM incorporates a fusion module that utilizes a Variational Autoencoder (VAE) and a self-attention mechanism to integrate multimodal features into a unified latent space, thereby facilitating robust semantic alignment with neural activity. Additionally, we introduce prompt techniques that adapt neural representations to individual differences, improving cross-subject generalization. Our approach also leverages the semantic knowledge from ten large pre-trained models to further enhance performance. Experimental results on the Natural Scenes Dataset (NSD) show that, compared to unimodal alignment methods, our method improves recognition tasks by 18.8 % and classification tasks by 4.30 %, compared to other multimodal alignment methods without the fusion module, our approach improves recognition tasks by 33.59 % and classification tasks by 4.26 %. These findings indicate that the MFA-NRM effectively resolves the problem of alignment ambiguity and enables richer semantic extraction from brain responses to multimodal visual stimuli, offering new perspectives for visual neural decoding.</div></div>\",\"PeriodicalId\":50367,\"journal\":{\"name\":\"Information Fusion\",\"volume\":\"127 \",\"pages\":\"Article 103717\"},\"PeriodicalIF\":15.5000,\"publicationDate\":\"2025-09-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Fusion\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1566253525007766\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525007766","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

整合图像和文本等多模态语义特征来增强视觉神经表征已被证明是一种有效的脑视觉解码策略。然而，以前的研究要么只关注单模态增强技术，要么没有充分解决不同模态之间的对齐模糊性，导致对多模态特征的互补优势的利用不足，或者导致所得到的神经表征的语义丰富性降低。为了解决这些限制，我们提出了一个多模态融合对齐神经表示模型（MFA-NRM），该模型通过整合图像和文本的多模态语义特征来增强视觉神经解码。MFA-NRM集成了一个融合模块，该模块利用变分自编码器（VAE）和自注意机制将多模态特征集成到统一的潜在空间中，从而促进与神经活动的鲁棒语义对齐。此外，我们引入了提示技术，使神经表征适应个体差异，提高跨主题泛化。我们的方法还利用来自10个大型预训练模型的语义知识来进一步提高性能。在自然场景数据集（NSD）上的实验结果表明，与单模态对齐方法相比，我们的方法识别任务和分类任务分别提高了18.8%和4.30%；与其他未添加融合模块的多模态对齐方法相比，我们的方法识别任务和分类任务分别提高了33.59%和4.26%。这些结果表明，MFA-NRM有效地解决了对齐模糊问题，并从大脑对多模态视觉刺激的反应中提取了更丰富的语义，为视觉神经解码提供了新的视角。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

MFA-NRM: A novel framework for multimodal fusion and semantic alignment in visual neural decoding

Integrating multimodal semantic features, such as images and text, to enhance visual neural representations has proven to be an effective strategy in brain visual decoding. However, previous studies have either focused solely on unimodal enhancement techniques or have inadequately addressed the alignment ambiguity between different modalities, leading to an underutilization of the complementary benefits of multimodal features or a reduction in the semantic richness of the resulting neural representations. To address these limitations, we propose a Multimodal Fusion Alignment Neural Representation Model (MFA-NRM), which enhances visual neural decoding by integrating multimodal semantic features from images and text. The MFA-NRM incorporates a fusion module that utilizes a Variational Autoencoder (VAE) and a self-attention mechanism to integrate multimodal features into a unified latent space, thereby facilitating robust semantic alignment with neural activity. Additionally, we introduce prompt techniques that adapt neural representations to individual differences, improving cross-subject generalization. Our approach also leverages the semantic knowledge from ten large pre-trained models to further enhance performance. Experimental results on the Natural Scenes Dataset (NSD) show that, compared to unimodal alignment methods, our method improves recognition tasks by 18.8 % and classification tasks by 4.30 %, compared to other multimodal alignment methods without the fusion module, our approach improves recognition tasks by 33.59 % and classification tasks by 4.26 %. These findings indicate that the MFA-NRM effectively resolves the problem of alignment ambiguity and enables richer semantic extraction from brain responses to multimodal visual stimuli, offering new perspectives for visual neural decoding.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Information Fusion 工程技术-计算机：理论方法

CiteScore

33.20

自引率

4.30%

发文量

161

审稿时长

7.9 months

期刊介绍： Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.