Multi-view learning of acoustic features for speaker recognition

2009 IEEE Workshop on Automatic Speech Recognition & Understanding Pub Date : 2009-12-01 DOI:10.1109/ASRU.2009.5373462

Karen Livescu, Mark Stoehr

引用次数: 24

Abstract

We consider learning acoustic feature transformations using an additional view of the data, in this case video of the speaker's face. Specifically, we consider a scenario in which clean audio and video is available at training time, while at test time only noisy audio is available. We use canonical correlation analysis (CCA) to learn linear projections of the acoustic observations that have maximum correlation with the video frames. We provide an initial demonstration of the approach on a speaker recognition task using data from the VidTIMIT corpus. The projected features, in combination with baseline MFCCs, outperform the baseline recognizer in noisy conditions. The techniques we present are quite general, although here we apply them to the case of a specific speaker recognition task. This is the first work of which we are aware in which multiple views are used to learn an acoustic feature projection at training time, while using only the acoustics at test time.

查看原文本刊更多论文

说话人识别声学特征的多视角学习

我们考虑使用额外的数据视图来学习声学特征转换，在这种情况下是说话人的面部视频。具体来说，我们考虑这样一种场景:在训练时可以使用干净的音频和视频，而在测试时只能使用嘈杂的音频。我们使用典型相关分析(CCA)来学习与视频帧具有最大相关性的声学观测的线性投影。我们使用VidTIMIT语料库中的数据对说话人识别任务的方法进行了初步演示。在噪声条件下，与基线mfc相结合的投影特征优于基线识别器。我们介绍的技术是相当通用的，尽管在这里我们将它们应用于特定的说话人识别任务。这是我们所知道的第一个在训练时使用多个视图来学习声学特征投影，而在测试时只使用声学的工作。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2009 IEEE Workshop on Automatic Speech Recognition & Understanding

自引率

0.00%

发文量