Xu Yin, Jiang Jiuchuan, Sheng Ge, John Qiang Gan, Haixian Wang
{"title":"Aligning machines and minds: Neural encoding for high-level visual cortices based on image captioning task.","authors":"Xu Yin, Jiang Jiuchuan, Sheng Ge, John Qiang Gan, Haixian Wang","doi":"10.1088/1741-2552/ae1164","DOIUrl":null,"url":null,"abstract":"<p><strong>Objective: </strong>Neural encoding of visual stimuli aims to predict brain responses in the visual cortex to different external inputs. Deep neural networks (DNNs) trained on relatively simple tasks such as image classification have been widely applied in neural encoding studies of early visual areas. However, due to the complex and abstract nature of semantic representations in high-level visual cortices, their encoding performance and interpretability remain limited.</p><p><strong>Approach: </strong>We propose a novel neural encoding model guided by the image captioning task (ICT). During image captioning, an attention module is employed to focus on key visual objects. In the neural encoding stage, a flexible receptive field (RF) module is designed to simulate voxel-level visual fields. To bridge the domain gap between these two processes, we introduce the Atten-RF module, which effectively aligns attention-guided visual representations with voxel-wise brain activity patterns.</p><p><strong>Main results: </strong>Experiments on the large-scale Natural Scenes Dataset (NSD) demonstrate that our method achieves superior average encoding performance across seven high-level visual cortices, with a mean squared error (MSE) of 0.765, Pearson correlation coefficient (PCC) of 0.443, and coefficient of determination (R²) of 0.245.</p><p><strong>Significance: </strong>By leveraging the guidance and alignment provided by a complex vision-language task, our model enhances the prediction of voxel activity in high-level visual cortex, offering a new perspective on the neural encoding problem. Furthermore, various visualization techniques provide deeper insights into the neural mechanisms underlying visual information processing.</p>","PeriodicalId":94096,"journal":{"name":"Journal of neural engineering","volume":" ","pages":""},"PeriodicalIF":3.8000,"publicationDate":"2025-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of neural engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1088/1741-2552/ae1164","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Objective: Neural encoding of visual stimuli aims to predict brain responses in the visual cortex to different external inputs. Deep neural networks (DNNs) trained on relatively simple tasks such as image classification have been widely applied in neural encoding studies of early visual areas. However, due to the complex and abstract nature of semantic representations in high-level visual cortices, their encoding performance and interpretability remain limited.
Approach: We propose a novel neural encoding model guided by the image captioning task (ICT). During image captioning, an attention module is employed to focus on key visual objects. In the neural encoding stage, a flexible receptive field (RF) module is designed to simulate voxel-level visual fields. To bridge the domain gap between these two processes, we introduce the Atten-RF module, which effectively aligns attention-guided visual representations with voxel-wise brain activity patterns.
Main results: Experiments on the large-scale Natural Scenes Dataset (NSD) demonstrate that our method achieves superior average encoding performance across seven high-level visual cortices, with a mean squared error (MSE) of 0.765, Pearson correlation coefficient (PCC) of 0.443, and coefficient of determination (R²) of 0.245.
Significance: By leveraging the guidance and alignment provided by a complex vision-language task, our model enhances the prediction of voxel activity in high-level visual cortex, offering a new perspective on the neural encoding problem. Furthermore, various visualization techniques provide deeper insights into the neural mechanisms underlying visual information processing.