A unified acoustic-to-speech-to-language embedding space captures the neural basis of natural language processing in everyday conversations

IF 21.4 1区心理学 Q1 MULTIDISCIPLINARY SCIENCES

Nature Human Behaviour Pub Date : 2025-03-07 DOI:10.1038/s41562-025-02105-9

Ariel Goldstein, Haocheng Wang, Leonard Niekerken, Mariano Schain, Zaid Zada, Bobbi Aubrey, Tom Sheffer, Samuel A. Nastase, Harshvardhan Gazula, Aditi Singh, Aditi Rao, Gina Choe, Catherine Kim, Werner Doyle, Daniel Friedman, Sasha Devore, Patricia Dugan, Avinatan Hassidim, Michael Brenner, Yossi Matias, Orrin Devinsky, Adeen Flinker, Uri Hasson

{"title":"A unified acoustic-to-speech-to-language embedding space captures the neural basis of natural language processing in everyday conversations","authors":"Ariel Goldstein, Haocheng Wang, Leonard Niekerken, Mariano Schain, Zaid Zada, Bobbi Aubrey, Tom Sheffer, Samuel A. Nastase, Harshvardhan Gazula, Aditi Singh, Aditi Rao, Gina Choe, Catherine Kim, Werner Doyle, Daniel Friedman, Sasha Devore, Patricia Dugan, Avinatan Hassidim, Michael Brenner, Yossi Matias, Orrin Devinsky, Adeen Flinker, Uri Hasson","doi":"10.1038/s41562-025-02105-9","DOIUrl":null,"url":null,"abstract":"<p>This study introduces a unified computational framework connecting acoustic, speech and word-level linguistic structures to study the neural basis of everyday conversations in the human brain. We used electrocorticography to record neural signals across 100 h of speech production and comprehension as participants engaged in open-ended real-life conversations. We extracted low-level acoustic, mid-level speech and contextual word embeddings from a multimodal speech-to-text model (Whisper). We developed encoding models that linearly map these embeddings onto brain activity during speech production and comprehension. Remarkably, this model accurately predicts neural activity at each level of the language processing hierarchy across hours of new conversations not used in training the model. The internal processing hierarchy in the model is aligned with the cortical hierarchy for speech and language processing, where sensory and motor regions better align with the model’s speech embeddings, and higher-level language areas better align with the model’s language embeddings. The Whisper model captures the temporal sequence of language-to-speech encoding before word articulation (speech production) and speech-to-language encoding post articulation (speech comprehension). The embeddings learned by this model outperform symbolic models in capturing neural activity supporting natural speech and language. These findings support a paradigm shift towards unified computational models that capture the entire processing hierarchy for speech comprehension and production in real-world conversations.</p>","PeriodicalId":19074,"journal":{"name":"Nature Human Behaviour","volume":"85 1","pages":""},"PeriodicalIF":21.4000,"publicationDate":"2025-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Nature Human Behaviour","FirstCategoryId":"102","ListUrlMain":"https://doi.org/10.1038/s41562-025-02105-9","RegionNum":1,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

This study introduces a unified computational framework connecting acoustic, speech and word-level linguistic structures to study the neural basis of everyday conversations in the human brain. We used electrocorticography to record neural signals across 100 h of speech production and comprehension as participants engaged in open-ended real-life conversations. We extracted low-level acoustic, mid-level speech and contextual word embeddings from a multimodal speech-to-text model (Whisper). We developed encoding models that linearly map these embeddings onto brain activity during speech production and comprehension. Remarkably, this model accurately predicts neural activity at each level of the language processing hierarchy across hours of new conversations not used in training the model. The internal processing hierarchy in the model is aligned with the cortical hierarchy for speech and language processing, where sensory and motor regions better align with the model’s speech embeddings, and higher-level language areas better align with the model’s language embeddings. The Whisper model captures the temporal sequence of language-to-speech encoding before word articulation (speech production) and speech-to-language encoding post articulation (speech comprehension). The embeddings learned by this model outperform symbolic models in capturing neural activity supporting natural speech and language. These findings support a paradigm shift towards unified computational models that capture the entire processing hierarchy for speech comprehension and production in real-world conversations.

Abstract Image

查看原文本刊更多论文

统一的声学-语音-语言嵌入空间捕捉日常对话中自然语言处理的神经基础

本研究引入了一个统一的计算框架，将声学、语音和词级语言结构连接起来，研究人脑日常对话的神经基础。我们用皮质电图记录了参与者在现实生活中开放式对话中100小时的语音产生和理解过程中的神经信号。我们从多模态语音到文本模型（Whisper）中提取低级声学、中级语音和上下文词嵌入。我们开发了编码模型，将这些嵌入线性映射到语音产生和理解过程中的大脑活动。值得注意的是，这个模型准确地预测了语言处理层次的每个层次的神经活动，跨越了几个小时的新对话，而不是在训练模型中使用。模型中的内部处理层次与语音和语言处理的皮层层次一致，其中感觉和运动区域更好地与模型的语音嵌入一致，而更高层次的语言区域更好地与模型的语言嵌入一致。Whisper模型捕获了单词发音前（语音产生）和发音后（语音理解）的语言到语音编码的时间序列。该模型学习的嵌入在捕获支持自然语音和语言的神经活动方面优于符号模型。这些发现支持了一种范式的转变，即向统一的计算模型转变，这些模型可以捕获现实世界对话中语音理解和生成的整个处理层次。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Nature Human Behaviour Psychology-Social Psychology

CiteScore

36.80

自引率

1.00%

发文量

227

期刊介绍： Nature Human Behaviour is a journal that focuses on publishing research of outstanding significance into any aspect of human behavior.The research can cover various areas such as psychological, biological, and social bases of human behavior.It also includes the study of origins, development, and disorders related to human behavior.The primary aim of the journal is to increase the visibility of research in the field and enhance its societal reach and impact.