Speaker adaptation of neural network acoustic models using i-vectors

2013 IEEE Workshop on Automatic Speech Recognition and Understanding Pub Date : 2013-12-01 DOI:10.1109/ASRU.2013.6707705

G. Saon, H. Soltau, D. Nahamoo, M. Picheny

引用次数: 650

Abstract

We propose to adapt deep neural network (DNN) acoustic models to a target speaker by supplying speaker identity vectors (i-vectors) as input features to the network in parallel with the regular acoustic features for ASR. For both training and test, the i-vector for a given speaker is concatenated to every frame belonging to that speaker and changes across different speakers. Experimental results on a Switchboard 300 hours corpus show that DNNs trained on speaker independent features and i-vectors achieve a 10% relative improvement in word error rate (WER) over networks trained on speaker independent features only. These networks are comparable in performance to DNNs trained on speaker-adapted features (with VTLN and FMLLR) with the advantage that only one decoding pass is needed. Furthermore, networks trained on speaker-adapted features and i-vectors achieve a 5-6% relative improvement in WER after hessian-free sequence training over networks trained on speaker-adapted features only.

查看原文本刊更多论文

基于i向量的说话人神经网络声学模型自适应

我们提出通过提供说话人身份向量(i-vectors)作为网络的输入特征，与ASR的常规声学特征并行，使深度神经网络(DNN)声学模型适应目标说话人。对于训练和测试，给定说话者的i向量连接到属于该说话者的每个帧，并在不同的说话者之间变化。在总机300小时语料库上的实验结果表明，基于说话人无关特征和i向量训练的深度神经网络在单词错误率(WER)方面比仅基于说话人无关特征训练的网络相对提高了10%。这些网络在性能上与基于扬声器自适应特征(带有VTLN和FMLLR)训练的dnn相当，其优点是只需要一次解码。此外，与仅使用说话人自适应特征训练的网络相比，使用说话人自适应特征和i向量训练的网络在经过无hessian序列训练后的WER相对提高了5-6%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2013 IEEE Workshop on Automatic Speech Recognition and Understanding

自引率

0.00%

发文量