Aligning machines and minds: Neural encoding for high-level visual cortices based on image captioning task.

IF 3.8

Journal of neural engineering Pub Date : 2025-10-09 DOI:10.1088/1741-2552/ae1164

Xu Yin, Jiang Jiuchuan, Sheng Ge, John Qiang Gan, Haixian Wang

{"title":"Aligning machines and minds: Neural encoding for high-level visual cortices based on image captioning task.","authors":"Xu Yin, Jiang Jiuchuan, Sheng Ge, John Qiang Gan, Haixian Wang","doi":"10.1088/1741-2552/ae1164","DOIUrl":null,"url":null,"abstract":"Objective: Neural encoding of visual stimuli aims to predict brain responses in the visual cortex to different external inputs. Deep neural networks (DNNs) trained on relatively simple tasks such as image classification have been widely applied in neural encoding studies of early visual areas. However, due to the complex and abstract nature of semantic representations in high-level visual cortices, their encoding performance and interpretability remain limited.Approach: We propose a novel neural encoding model guided by the image captioning task (ICT). During image captioning, an attention module is employed to focus on key visual objects. In the neural encoding stage, a flexible receptive field (RF) module is designed to simulate voxel-level visual fields. To bridge the domain gap between these two processes, we introduce the Atten-RF module, which effectively aligns attention-guided visual representations with voxel-wise brain activity patterns.Main results: Experiments on the large-scale Natural Scenes Dataset (NSD) demonstrate that our method achieves superior average encoding performance across seven high-level visual cortices, with a mean squared error (MSE) of 0.765, Pearson correlation coefficient (PCC) of 0.443, and coefficient of determination (R²) of 0.245.Significance: By leveraging the guidance and alignment provided by a complex vision-language task, our model enhances the prediction of voxel activity in high-level visual cortex, offering a new perspective on the neural encoding problem. Furthermore, various visualization techniques provide deeper insights into the neural mechanisms underlying visual information processing.","PeriodicalId":94096,"journal":{"name":"Journal of neural engineering","volume":" ","pages":""},"PeriodicalIF":3.8000,"publicationDate":"2025-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of neural engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1088/1741-2552/ae1164","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Objective: Neural encoding of visual stimuli aims to predict brain responses in the visual cortex to different external inputs. Deep neural networks (DNNs) trained on relatively simple tasks such as image classification have been widely applied in neural encoding studies of early visual areas. However, due to the complex and abstract nature of semantic representations in high-level visual cortices, their encoding performance and interpretability remain limited.

Approach: We propose a novel neural encoding model guided by the image captioning task (ICT). During image captioning, an attention module is employed to focus on key visual objects. In the neural encoding stage, a flexible receptive field (RF) module is designed to simulate voxel-level visual fields. To bridge the domain gap between these two processes, we introduce the Atten-RF module, which effectively aligns attention-guided visual representations with voxel-wise brain activity patterns.

Main results: Experiments on the large-scale Natural Scenes Dataset (NSD) demonstrate that our method achieves superior average encoding performance across seven high-level visual cortices, with a mean squared error (MSE) of 0.765, Pearson correlation coefficient (PCC) of 0.443, and coefficient of determination (R²) of 0.245.

Significance: By leveraging the guidance and alignment provided by a complex vision-language task, our model enhances the prediction of voxel activity in high-level visual cortex, offering a new perspective on the neural encoding problem. Furthermore, various visualization techniques provide deeper insights into the neural mechanisms underlying visual information processing.

查看原文本刊更多论文

对齐机器和思维：基于图像字幕任务的高级视觉皮层神经编码。

目的：视觉刺激的神经编码旨在预测大脑视觉皮层对不同外部输入的反应。在图像分类等相对简单的任务上训练的深度神经网络（dnn）已广泛应用于早期视觉区域的神经编码研究。然而，由于高级视觉皮层语义表征的复杂性和抽象性，其编码性能和可解释性仍然受到限制。方法：提出了一种基于图像字幕任务（ICT）的神经编码模型。在图像字幕过程中，使用注意力模块将注意力集中在关键的视觉对象上。在神经编码阶段，设计了一个灵活的接受场模块来模拟体素级的视野。为了弥合这两个过程之间的域差距，我们引入了注意- rf模块，该模块有效地将注意引导的视觉表征与体素脑活动模式结合起来。主要结果：在大规模自然场景数据集（NSD）上的实验表明，我们的方法在7个高级视觉皮层上取得了优异的平均编码性能，均方误差（MSE）为0.765，Pearson相关系数（PCC）为0.443，决定系数（R²）为0.245。意义：通过利用复杂视觉语言任务提供的引导和对齐，我们的模型增强了对高级视觉皮层体素活动的预测，为神经编码问题提供了一个新的视角。此外，各种可视化技术为视觉信息处理背后的神经机制提供了更深入的见解。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of neural engineering

自引率

0.00%

发文量