基于多模态转换器的大脑编码模型可在语言和视觉之间转移。

Advances in neural information processing systems Pub Date : 2023-12-01

Jerry Tang, Meng Du, Vy A Vo, Vasudev Lal, Alexander G Huth

{"title":"基于多模态转换器的大脑编码模型可在语言和视觉之间转移。","authors":"Jerry Tang, Meng Du, Vy A Vo, Vasudev Lal, Alexander G Huth","doi":"","DOIUrl":null,"url":null,"abstract":"Encoding models have been used to assess how the human brain represents concepts in language and vision. While language and vision rely on similar concept representations, current encoding models are typically trained and tested on brain responses to each modality in isolation. Recent advances in multimodal pretraining have produced transformers that can extract aligned representations of concepts in language and vision. In this work, we used representations from multimodal transformers to train encoding models that can transfer across fMRI responses to stories and movies. We found that encoding models trained on brain responses to one modality can successfully predict brain responses to the other modality, particularly in cortical regions that represent conceptual meaning. Further analysis of these encoding models revealed shared semantic dimensions that underlie concept representations in language and vision. Comparing encoding models trained using representations from multimodal and unimodal transformers, we found that multimodal transformers learn more aligned representations of concepts in language and vision. Our results demonstrate how multimodal transformers can provide insights into the brain's capacity for multimodal processing.","PeriodicalId":72099,"journal":{"name":"Advances in neural information processing systems","volume":"36 ","pages":"29654-29666"},"PeriodicalIF":0.0000,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11250991/pdf/","citationCount":"0","resultStr":"{\"title\":\"Brain encoding models based on multimodal transformers can transfer across language and vision.\",\"authors\":\"Jerry Tang, Meng Du, Vy A Vo, Vasudev Lal, Alexander G Huth\",\"doi\":\"\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Encoding models have been used to assess how the human brain represents concepts in language and vision. While language and vision rely on similar concept representations, current encoding models are typically trained and tested on brain responses to each modality in isolation. Recent advances in multimodal pretraining have produced transformers that can extract aligned representations of concepts in language and vision. In this work, we used representations from multimodal transformers to train encoding models that can transfer across fMRI responses to stories and movies. We found that encoding models trained on brain responses to one modality can successfully predict brain responses to the other modality, particularly in cortical regions that represent conceptual meaning. Further analysis of these encoding models revealed shared semantic dimensions that underlie concept representations in language and vision. Comparing encoding models trained using representations from multimodal and unimodal transformers, we found that multimodal transformers learn more aligned representations of concepts in language and vision. Our results demonstrate how multimodal transformers can provide insights into the brain's capacity for multimodal processing.\",\"PeriodicalId\":72099,\"journal\":{\"name\":\"Advances in neural information processing systems\",\"volume\":\"36 \",\"pages\":\"29654-29666\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11250991/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Advances in neural information processing systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advances in neural information processing systems","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

编码模型已被用于评估人脑如何在语言和视觉中表征概念。虽然语言和视觉依赖于相似的概念表征，但目前的编码模型通常是根据大脑对每种模态的反应单独进行训练和测试的。多模态预训练方面的最新进展已经产生了转换器，可以提取语言和视觉中概念的一致表征。在这项工作中，我们利用多模态转换器的表征来训练编码模型，这些模型可以在对故事和电影的 fMRI 反应中进行转换。我们发现，根据大脑对一种模式的反应而训练的编码模型可以成功预测大脑对另一种模式的反应，尤其是在代表概念意义的皮层区域。对这些编码模型的进一步分析揭示了语言和视觉中概念表征的共同语义维度。通过比较使用多模态转换器和单模态转换器的表征训练的编码模型，我们发现多模态转换器学习到的语言和视觉中的概念表征更加一致。我们的研究结果表明，多模态转换器可以让人们深入了解大脑的多模态处理能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

本刊更多论文

Brain encoding models based on multimodal transformers can transfer across language and vision.

Encoding models have been used to assess how the human brain represents concepts in language and vision. While language and vision rely on similar concept representations, current encoding models are typically trained and tested on brain responses to each modality in isolation. Recent advances in multimodal pretraining have produced transformers that can extract aligned representations of concepts in language and vision. In this work, we used representations from multimodal transformers to train encoding models that can transfer across fMRI responses to stories and movies. We found that encoding models trained on brain responses to one modality can successfully predict brain responses to the other modality, particularly in cortical regions that represent conceptual meaning. Further analysis of these encoding models revealed shared semantic dimensions that underlie concept representations in language and vision. Comparing encoding models trained using representations from multimodal and unimodal transformers, we found that multimodal transformers learn more aligned representations of concepts in language and vision. Our results demonstrate how multimodal transformers can provide insights into the brain's capacity for multimodal processing.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Advances in neural information processing systems

自引率

0.00%

发文量