一种多流的视听自动语音识别方法

2007 IEEE 9th Workshop on Multimedia Signal Processing Pub Date : 2007-10-01 DOI:10.1109/MMSP.2007.4412884

M. Hasegawa-Johnson

{"title":"一种多流的视听自动语音识别方法","authors":"M. Hasegawa-Johnson","doi":"10.1109/MMSP.2007.4412884","DOIUrl":null,"url":null,"abstract":"This paper proposes a multi-stream approach to automatic audiovisual speech recognition, based in part on Hickok and Poeppel's dual-stream model of human speech processing. The dual-stream model proposes that semantic networks may be accessed by at least three parallel neural streams: at least two ventral streams that map directly from acoustics to words (with different time scales), and at least one dorsal stream that maps from acoustics to articulation. Our implementation represents each of these streams by a dynamic Bayesian network; disagreements between the three streams are resolved using a voting scheme. The proposed algorithm was tested using the CUAVE audiovisual speech corpus. Results indicate that the ventral stream model tends to make fewer mistakes in the labeling of vowels, while the dorsal stream model tends to make fewer mistakes in the labeling of consonants; the recognizer voting scheme takes advantage of these differences to reduce overall word error rate.","PeriodicalId":225295,"journal":{"name":"2007 IEEE 9th Workshop on Multimedia Signal Processing","volume":"43 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"A Multi-Stream Approach to Audiovisual Automatic Speech Recognition\",\"authors\":\"M. Hasegawa-Johnson\",\"doi\":\"10.1109/MMSP.2007.4412884\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper proposes a multi-stream approach to automatic audiovisual speech recognition, based in part on Hickok and Poeppel's dual-stream model of human speech processing. The dual-stream model proposes that semantic networks may be accessed by at least three parallel neural streams: at least two ventral streams that map directly from acoustics to words (with different time scales), and at least one dorsal stream that maps from acoustics to articulation. Our implementation represents each of these streams by a dynamic Bayesian network; disagreements between the three streams are resolved using a voting scheme. The proposed algorithm was tested using the CUAVE audiovisual speech corpus. Results indicate that the ventral stream model tends to make fewer mistakes in the labeling of vowels, while the dorsal stream model tends to make fewer mistakes in the labeling of consonants; the recognizer voting scheme takes advantage of these differences to reduce overall word error rate.\",\"PeriodicalId\":225295,\"journal\":{\"name\":\"2007 IEEE 9th Workshop on Multimedia Signal Processing\",\"volume\":\"43 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2007-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2007 IEEE 9th Workshop on Multimedia Signal Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/MMSP.2007.4412884\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2007 IEEE 9th Workshop on Multimedia Signal Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MMSP.2007.4412884","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

本文在Hickok和Poeppel的人类语音处理双流模型的基础上，提出了一种多流的自动视听语音识别方法。双流模型提出语义网络可以被至少三个并行的神经流访问:至少两个直接从声学映射到单词(具有不同的时间尺度)的腹侧流，以及至少一个从声学映射到发音的背侧流。我们的实现通过一个动态贝叶斯网络来表示这些流;三个流之间的分歧使用投票方案来解决。利用中国视听语料库对该算法进行了测试。结果表明，腹侧流模型在元音标记上的错误倾向较少，而背侧流模型在辅音标记上的错误倾向较少;识别器投票方案利用这些差异来降低整体单词错误率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Multi-Stream Approach to Audiovisual Automatic Speech Recognition

This paper proposes a multi-stream approach to automatic audiovisual speech recognition, based in part on Hickok and Poeppel's dual-stream model of human speech processing. The dual-stream model proposes that semantic networks may be accessed by at least three parallel neural streams: at least two ventral streams that map directly from acoustics to words (with different time scales), and at least one dorsal stream that maps from acoustics to articulation. Our implementation represents each of these streams by a dynamic Bayesian network; disagreements between the three streams are resolved using a voting scheme. The proposed algorithm was tested using the CUAVE audiovisual speech corpus. Results indicate that the ventral stream model tends to make fewer mistakes in the labeling of vowels, while the dorsal stream model tends to make fewer mistakes in the labeling of consonants; the recognizer voting scheme takes advantage of these differences to reduce overall word error rate.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2007 IEEE 9th Workshop on Multimedia Signal Processing

自引率

0.00%

发文量