Multi-stream Asynchrony Modeling for Audio-Visual Speech Recognition

Ninth IEEE International Symposium on Multimedia (ISM 2007) Pub Date : 2007-12-10 DOI:10.1109/ISM.2007.21

Guoyun Lv, D. Jiang, R. Zhao, Yunshu Hou

引用次数: 11

Abstract

In this paper, two multi-stream asynchrony Dynamic Bayesian Network models (MS-ADBN model and MM-ADBN model) are proposed for audio-visual speech recognition (AVSR). The proposed models, with different topology structures, loose the asynchrony of audio and visual streams to word level. For MS-ADBN model, both in audio stream and in visual stream, each word is composed of its corresponding phones, and each phone is associated with observation vector. MM- ADBN model is an augmentation of MS-ADBN model, a level of hidden nodes--state level, is added between the phone level and the observation node level, to describe the dynamic process of phones. Essentially, MS-ADBN model is a word model, while MM-ADBN model is a phone model. Speech recognition experiments are done on a digit audio-visual (A-V) database, as well as on a continuous A-V database. The results demonstrate that the asynchrony description between audio and visual stream is important for AVSR system, and MM-ADBN model has the best performance for the task of continuous A-V speech recognition.

查看原文本刊更多论文

视听语音识别的多流异步建模

本文提出了用于视听语音识别的两种多流异步动态贝叶斯网络模型(MS-ADBN模型和MM-ADBN模型)。所提出的模型具有不同的拓扑结构，将音视频流的异步性降低到字级。对于MS-ADBN模型，无论是在音频流还是在视觉流中，每个单词都由其对应的电话组成，每个电话与观测向量相关联。MM- ADBN模型是对MS-ADBN模型的增强，在手机级和观测节点级之间增加了一个隐藏节点级——状态级，用以描述手机的动态过程。MS-ADBN模型本质上是一个词模型，MM-ADBN模型本质上是一个电话模型。语音识别实验分别在数字视听数据库和连续视听数据库上进行。结果表明，音频和视频流之间的异步描述对AVSR系统至关重要，MM-ADBN模型对于连续的A-V语音识别任务具有最佳的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Ninth IEEE International Symposium on Multimedia (ISM 2007)

自引率

0.00%

发文量