Frame-dependent multi-stream reliability indicators for audio-visual speech recognition

2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03). Pub Date : 2003-07-06 DOI:10.1109/ICASSP.2003.1198707

A. Garg, G. Potamianos, C. Neti, Thomas S. Huang

{"title":"Frame-dependent multi-stream reliability indicators for audio-visual speech recognition","authors":"A. Garg, G. Potamianos, C. Neti, Thomas S. Huang","doi":"10.1109/ICASSP.2003.1198707","DOIUrl":null,"url":null,"abstract":"We investigate the use of local, frame-dependent reliability indicators of the audio and visual modalities, as a means of estimating stream exponents of multi-stream hidden Markov models for audio-visual automatic speech recognition. We consider two such indicators at each modality, defined as functions of the speech-class conditional observation probabilities of appropriate audio-or visual-only classifiers. We subsequently map the four reliability indicators into the stream exponents of a state-synchronous, two-stream hidden Markov model, as a sigmoid function of their linear combination. We propose two algorithms to estimate the sigmoid weights, based on the maximum conditional likelihood and minimum classification error criteria. We demonstrate the superiority of the proposed approach on a connected-digit audio-visual speech recognition task, under varying audio channel noise conditions. Indeed, the use of the estimated, frame-dependent stream exponents results in a significantly smaller word error rate than using global stream exponents. In addition, it outperforms utterance-level exponents, even though the latter utilize a-priori knowledge of the utterance noise level.","PeriodicalId":104473,"journal":{"name":"2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03).","volume":"84 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2003-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"28","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03).","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICASSP.2003.1198707","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 28

Abstract

We investigate the use of local, frame-dependent reliability indicators of the audio and visual modalities, as a means of estimating stream exponents of multi-stream hidden Markov models for audio-visual automatic speech recognition. We consider two such indicators at each modality, defined as functions of the speech-class conditional observation probabilities of appropriate audio-or visual-only classifiers. We subsequently map the four reliability indicators into the stream exponents of a state-synchronous, two-stream hidden Markov model, as a sigmoid function of their linear combination. We propose two algorithms to estimate the sigmoid weights, based on the maximum conditional likelihood and minimum classification error criteria. We demonstrate the superiority of the proposed approach on a connected-digit audio-visual speech recognition task, under varying audio channel noise conditions. Indeed, the use of the estimated, frame-dependent stream exponents results in a significantly smaller word error rate than using global stream exponents. In addition, it outperforms utterance-level exponents, even though the latter utilize a-priori knowledge of the utterance noise level.

查看原文本刊更多论文

基于帧的多流音频语音识别可靠性指标

我们研究了音频和视觉模态的局部、帧相关可靠性指标的使用，作为估计音频和视觉自动语音识别的多流隐马尔可夫模型的流指数的一种手段。我们在每个模态中考虑两个这样的指标，定义为适当的音频或视觉分类器的语音类条件观察概率的函数。随后，我们将四个可靠性指标映射到状态同步的双流隐马尔可夫模型的流指数中，作为它们线性组合的s型函数。我们提出了基于最大条件似然和最小分类误差准则的两种算法来估计s形权值。在不同声道噪声条件下，我们证明了所提出的方法在连接数字视听语音识别任务中的优越性。实际上，与使用全局流指数相比，使用估计的、帧相关的流指数可以显著降低单词错误率。此外，它优于话语级指数，尽管后者利用了话语噪声水平的先验知识。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03).

自引率

0.00%

发文量