An investigation of augmenting speaker representations to improve speaker normalisation for DNN-based speech recognition

2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Pub Date : 2015-04-19 DOI:10.1109/ICASSP.2015.7178844

Hengguan Huang, K. Sim

{"title":"An investigation of augmenting speaker representations to improve speaker normalisation for DNN-based speech recognition","authors":"Hengguan Huang, K. Sim","doi":"10.1109/ICASSP.2015.7178844","DOIUrl":null,"url":null,"abstract":"The conventional short-term interval features used by the Deep Neural Networks (DNNs) lack the ability to learn longer term information. This poses a challenge for training a speaker-independent (SI) DNN since the short-term features do not provide sufficient information for the DNN to estimate the real robust factors of speaker-level variations. The key to this problem is to obtain a sufficiently robust and informative speaker representation. This paper compares several speaker representations. Firstly, a DNN speaker classifier is used to extract the bottleneck features as the speaker representation, called the Bottleneck Speaker Vector (BSV). To further improve the robustness of this representation, a first-order Bottleneck Speaker Super Vector (BSSV) is also proposed, where the BSV is expanded into a super vector space by incorporating the phoneme posterior probabilities. Finally, a more fine-grain speaker representation based on the FMLLR-shifted features is examined. The experimental results on the WSJ0 and WSJ1 datasets show that the proposed speaker representations are useful in normalising the speaker effects for robust DNN-based automatic speech recognition. The best performance is achieved by augmenting both the BSSV and the FMLLR-shifted representations, yielding 10.0% - 15.3% relatively performance gains over the SI DNN baseline.","PeriodicalId":117666,"journal":{"name":"2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"53","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICASSP.2015.7178844","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 53

Abstract

The conventional short-term interval features used by the Deep Neural Networks (DNNs) lack the ability to learn longer term information. This poses a challenge for training a speaker-independent (SI) DNN since the short-term features do not provide sufficient information for the DNN to estimate the real robust factors of speaker-level variations. The key to this problem is to obtain a sufficiently robust and informative speaker representation. This paper compares several speaker representations. Firstly, a DNN speaker classifier is used to extract the bottleneck features as the speaker representation, called the Bottleneck Speaker Vector (BSV). To further improve the robustness of this representation, a first-order Bottleneck Speaker Super Vector (BSSV) is also proposed, where the BSV is expanded into a super vector space by incorporating the phoneme posterior probabilities. Finally, a more fine-grain speaker representation based on the FMLLR-shifted features is examined. The experimental results on the WSJ0 and WSJ1 datasets show that the proposed speaker representations are useful in normalising the speaker effects for robust DNN-based automatic speech recognition. The best performance is achieved by augmenting both the BSSV and the FMLLR-shifted representations, yielding 10.0% - 15.3% relatively performance gains over the SI DNN baseline.

查看原文本刊更多论文

基于dnn的语音识别中增强说话人表示改进说话人归一化的研究

深度神经网络(dnn)使用的传统短期间隔特征缺乏学习长期信息的能力。这对训练独立于说话人(SI)的深度神经网络提出了挑战，因为短期特征不能为深度神经网络提供足够的信息来估计说话人水平变化的真正鲁棒因素。解决这一问题的关键是获得足够鲁棒且信息丰富的说话人表示。本文比较了几种说话人表示。首先，使用DNN说话人分类器提取瓶颈特征作为说话人表示，称为瓶颈说话人向量(BSV)。为了进一步提高该表示的鲁棒性，本文还提出了一阶瓶颈说话人超级向量(BSSV)，其中BSSV通过结合音素后验概率扩展为一个超级向量空间。最后，研究了基于fmllr移位特征的更细粒度的说话人表示。在WSJ0和WSJ1数据集上的实验结果表明，所提出的说话人表示对基于dnn的鲁棒自动语音识别的说话人效果进行了归一化处理。通过增加BSSV和fmllr移位表示来实现最佳性能，相对于SI DNN基线，性能提高10.0% - 15.3%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

自引率

0.00%

发文量