发布求助

文献互助智能选刊最新文献

基于一致视听语音训练的深度神经网络报告了McGurk效应。

bioRxiv : the preprint server for biology Pub Date : 2025-09-30 DOI:10.1101/2025.08.20.671347

Haotian Ma, Zhengjia Wang, Xiang Zhang, John F Magnotti, Michael S Beauchamp

{"title":"基于一致视听语音训练的深度神经网络报告了McGurk效应。","authors":"Haotian Ma, Zhengjia Wang, Xiang Zhang, John F Magnotti, Michael S Beauchamp","doi":"10.1101/2025.08.20.671347","DOIUrl":null,"url":null,"abstract":"<p><p>In the McGurk effect, incongruent auditory and visual syllables are perceived as a third, illusory syllable. The prevailing explanation for the effect is that the illusory syllable is a consensus percept intermediate between otherwise incompatible auditory and visual representations. To test this idea, we turned to a deep neural network known as AVHuBERT that transcribes audiovisual speech with high accuracy. Critically, AVHuBERT was trained only with <i>congruent</i> audiovisual speech, without exposure to McGurk stimuli or other incongruent speech. In the current study, when tested with congruent audiovisual \"ba\", \"ga\" and \"da\" syllables recorded from 8 different talkers, AVHuBERT transcribed them with near-perfect accuracy, and showed a human-like pattern of highest accuracy for audiovisual speech, slightly lower accuracy for auditory-only speech, and low accuracy for visual-only speech. When presented with incongruent McGurk syllables (auditory \"ba\" paired with visual \"ga\"), AVHuBERT reported the McGurk fusion percept of \"da\" at a rate of 25%, many-fold greater than the rate for either auditory or visual components of the McGurk stimulus presented on their own. To examine the individual variability that is hallmark of human perception of the McGurk effect, 100 variants of AVHuBERT were constructed. Like human observers, AVHuBERT variants was consistently accurate for congruent syllables but highly variable for McGurk syllables. Similarities between the responses of AVHuBERT and humans to congruent and incongruent audiovisual speech, including the McGurk effect, suggests that DNNs may be a useful tool for interrogating the perceptual and neural mechanisms of human audiovisual speech perception.</p>","PeriodicalId":519960,"journal":{"name":"bioRxiv : the preprint server for biology","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12393562/pdf/","citationCount":"0","resultStr":"{\"title\":\"A Deep Neural Network Trained on Congruent Audiovisual Speech Reports the McGurk Effect.\",\"authors\":\"Haotian Ma, Zhengjia Wang, Xiang Zhang, John F Magnotti, Michael S Beauchamp\",\"doi\":\"10.1101/2025.08.20.671347\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>In the McGurk effect, incongruent auditory and visual syllables are perceived as a third, illusory syllable. The prevailing explanation for the effect is that the illusory syllable is a consensus percept intermediate between otherwise incompatible auditory and visual representations. To test this idea, we turned to a deep neural network known as AVHuBERT that transcribes audiovisual speech with high accuracy. Critically, AVHuBERT was trained only with <i>congruent</i> audiovisual speech, without exposure to McGurk stimuli or other incongruent speech. In the current study, when tested with congruent audiovisual \\\"ba\\\", \\\"ga\\\" and \\\"da\\\" syllables recorded from 8 different talkers, AVHuBERT transcribed them with near-perfect accuracy, and showed a human-like pattern of highest accuracy for audiovisual speech, slightly lower accuracy for auditory-only speech, and low accuracy for visual-only speech. When presented with incongruent McGurk syllables (auditory \\\"ba\\\" paired with visual \\\"ga\\\"), AVHuBERT reported the McGurk fusion percept of \\\"da\\\" at a rate of 25%, many-fold greater than the rate for either auditory or visual components of the McGurk stimulus presented on their own. To examine the individual variability that is hallmark of human perception of the McGurk effect, 100 variants of AVHuBERT were constructed. Like human observers, AVHuBERT variants was consistently accurate for congruent syllables but highly variable for McGurk syllables. Similarities between the responses of AVHuBERT and humans to congruent and incongruent audiovisual speech, including the McGurk effect, suggests that DNNs may be a useful tool for interrogating the perceptual and neural mechanisms of human audiovisual speech perception.</p>\",\"PeriodicalId\":519960,\"journal\":{\"name\":\"bioRxiv : the preprint server for biology\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-09-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12393562/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"bioRxiv : the preprint server for biology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1101/2025.08.20.671347\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"bioRxiv : the preprint server for biology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2025.08.20.671347","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

在McGurk效应中，听觉和视觉不一致的音节被认为是第三个，虚幻的音节。对这种效应的普遍解释是，幻觉音节是一种共识感知，介于不相容的听觉和视觉表征之间。为了验证这个想法，我们求助于一个被称为AVHuBERT的深度神经网络，它可以高精度地转录视听语音。关键的是，AVHuBERT只接受过一致的视听语音训练，没有接触过McGurk刺激或其他不一致的语音。在目前的研究中，当对8个不同说话者录下的一致的视听“ba”、“ga”和“da”音节进行测试时，AVHuBERT以近乎完美的准确率转录了它们，并显示出类似人类的模式：视听语音的准确率最高，纯听觉语音的准确率略低，纯视觉语音的准确率较低。当出现不一致的McGurk音节（听觉“ba”与视觉“ga”配对）时，AVHuBERT报告说，“da”的McGurk融合率为25%，比单独呈现McGurk刺激的听觉或视觉成分的比率高出许多倍。为了检验作为人类感知McGurk效应标志的个体差异，构建了100个AVHuBERT变体。和人类观察者一样，AVHuBERT变体对音节一致性的判断始终是准确的，但对McGurk音节的判断则是高度可变的。AVHuBERT和人类对一致和不一致视听语音的反应之间的相似性，包括McGurk效应，表明dnn可能是一个有用的工具，用于询问人类视听语音感知的感知和神经机制。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

A Deep Neural Network Trained on Congruent Audiovisual Speech Reports the McGurk Effect.

查看原文本刊更多论文

A Deep Neural Network Trained on Congruent Audiovisual Speech Reports the McGurk Effect.

In the McGurk effect, incongruent auditory and visual syllables are perceived as a third, illusory syllable. The prevailing explanation for the effect is that the illusory syllable is a consensus percept intermediate between otherwise incompatible auditory and visual representations. To test this idea, we turned to a deep neural network known as AVHuBERT that transcribes audiovisual speech with high accuracy. Critically, AVHuBERT was trained only with congruent audiovisual speech, without exposure to McGurk stimuli or other incongruent speech. In the current study, when tested with congruent audiovisual "ba", "ga" and "da" syllables recorded from 8 different talkers, AVHuBERT transcribed them with near-perfect accuracy, and showed a human-like pattern of highest accuracy for audiovisual speech, slightly lower accuracy for auditory-only speech, and low accuracy for visual-only speech. When presented with incongruent McGurk syllables (auditory "ba" paired with visual "ga"), AVHuBERT reported the McGurk fusion percept of "da" at a rate of 25%, many-fold greater than the rate for either auditory or visual components of the McGurk stimulus presented on their own. To examine the individual variability that is hallmark of human perception of the McGurk effect, 100 variants of AVHuBERT were constructed. Like human observers, AVHuBERT variants was consistently accurate for congruent syllables but highly variable for McGurk syllables. Similarities between the responses of AVHuBERT and humans to congruent and incongruent audiovisual speech, including the McGurk effect, suggests that DNNs may be a useful tool for interrogating the perceptual and neural mechanisms of human audiovisual speech perception.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

bioRxiv : the preprint server for biology

bioRxiv : the preprint server for biology

自引率

0.00%

发文量

0

联系我们：info@booksci.cn Book学术提供免费学术资源搜索服务，方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1

京公网安备 11010802042870号

Book学术文献互助

Book学术文献互助群
群号：604180095

Book学术官方微信