Joint audio-video processing for biometric speaker identification

2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03). Pub Date : 2003-07-06 DOI:10.1109/ICASSP.2003.1202376

A. Kanak, E. Erzin, Y. Yemez, A. Tekalp

引用次数: 36

Abstract

We present a bimodal audio-visual speaker identification system. The objective is to improve the recognition performance over conventional unimodal schemes. The proposed system exploits not only the temporal and spatial correlations existing in the speech and video signals of a speaker, but also the cross-correlation between these two modalities. Lip images extracted from each video frame are transformed onto an eigenspace. The obtained eigenlip coefficients are interpolated to match the rate of the speech signal and fused with Mel frequency cepstral coefficients (MFCC) of the corresponding speech signal. The resulting joint feature vectors are used to train and test a hidden Markov model (HMM) based identification system. Experimental results are included to demonstrate the system performance.

查看原文本刊更多论文

生物特征说话人识别的声视频联合处理

提出了一种双峰视听说话人识别系统。目标是提高传统单峰方案的识别性能。该系统不仅利用了说话者的语音和视频信号存在的时间和空间相关性，而且利用了这两个模态之间的相互关系。从每个视频帧中提取的唇形图像被转换到特征空间中。将得到的特征唇系数内插以匹配语音信号的速率，并与相应语音信号的Mel频率倒谱系数(MFCC)融合。将得到的联合特征向量用于训练和测试基于隐马尔可夫模型(HMM)的识别系统。实验结果验证了系统的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03).

自引率

0.00%

发文量