基于神经网络的快速说话人自适应滤波库层自动语音识别

2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI:10.1109/SLT.2018.8639648

Hiroshi Seki, Kazumasa Yamamoto, T. Akiba, S. Nakagawa

{"title":"基于神经网络的快速说话人自适应滤波库层自动语音识别","authors":"Hiroshi Seki, Kazumasa Yamamoto, T. Akiba, S. Nakagawa","doi":"10.1109/SLT.2018.8639648","DOIUrl":null,"url":null,"abstract":"Deep neural networks (DNN) have achieved significant success in the field of automatic speech recognition. Previously, we proposed a filterbank-incorporated DNN which takes power spectra as input features. This method has a function of VTLN (Vocal tract length normalization) and fMLLR (feature-space maximum likelihood linear regression). The filterbank layer can be implemented by using a small number of parameters and is optimized under a framework of backpropagation. Therefore, it is advantageous in adaptation under limited available data. In this paper, speaker adaptation is applied to the filterbank-incorporated DNN. By applying speaker adaptation using 15 utterances, the adapted model gave a 7.4% relative improvement in WER over the baseline DNN at a significance level of 0.005 on CSJ task. Adaptation of filterbank layer also showed better performance than the other adaptation methods; singular value decomposition (SVD) based adaptation and learning hidden unit contributions (LHUC).","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Rapid Speaker Adaptation of Neural Network Based Filterbank Layer for Automatic Speech Recognition\",\"authors\":\"Hiroshi Seki, Kazumasa Yamamoto, T. Akiba, S. Nakagawa\",\"doi\":\"10.1109/SLT.2018.8639648\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Deep neural networks (DNN) have achieved significant success in the field of automatic speech recognition. Previously, we proposed a filterbank-incorporated DNN which takes power spectra as input features. This method has a function of VTLN (Vocal tract length normalization) and fMLLR (feature-space maximum likelihood linear regression). The filterbank layer can be implemented by using a small number of parameters and is optimized under a framework of backpropagation. Therefore, it is advantageous in adaptation under limited available data. In this paper, speaker adaptation is applied to the filterbank-incorporated DNN. By applying speaker adaptation using 15 utterances, the adapted model gave a 7.4% relative improvement in WER over the baseline DNN at a significance level of 0.005 on CSJ task. Adaptation of filterbank layer also showed better performance than the other adaptation methods; singular value decomposition (SVD) based adaptation and learning hidden unit contributions (LHUC).\",\"PeriodicalId\":377307,\"journal\":{\"name\":\"2018 IEEE Spoken Language Technology Workshop (SLT)\",\"volume\":\"26 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE Spoken Language Technology Workshop (SLT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SLT.2018.8639648\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE Spoken Language Technology Workshop (SLT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SLT.2018.8639648","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

深度神经网络(DNN)在语音自动识别领域取得了显著的成功。在此之前，我们提出了一种以功率谱作为输入特征的包含滤波器组的深度神经网络。该方法具有声道长度归一化(VTLN)和特征空间最大似然线性回归(fMLLR)的功能。滤波器组层可以使用少量参数实现，并在反向传播框架下进行优化。因此，它有利于在有限的可用数据下的适应。本文将说话人自适应方法应用于含滤波器组的深度神经网络。通过使用15个话语进行说话人自适应，该模型在CSJ任务上比基线DNN相对提高了7.4%，显著性水平为0.005。滤波组层自适应也比其他自适应方法表现出更好的性能;基于奇异值分解(SVD)的自适应和学习隐藏单元贡献(LHUC)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Rapid Speaker Adaptation of Neural Network Based Filterbank Layer for Automatic Speech Recognition

Deep neural networks (DNN) have achieved significant success in the field of automatic speech recognition. Previously, we proposed a filterbank-incorporated DNN which takes power spectra as input features. This method has a function of VTLN (Vocal tract length normalization) and fMLLR (feature-space maximum likelihood linear regression). The filterbank layer can be implemented by using a small number of parameters and is optimized under a framework of backpropagation. Therefore, it is advantageous in adaptation under limited available data. In this paper, speaker adaptation is applied to the filterbank-incorporated DNN. By applying speaker adaptation using 15 utterances, the adapted model gave a 7.4% relative improvement in WER over the baseline DNN at a significance level of 0.005 on CSJ task. Adaptation of filterbank layer also showed better performance than the other adaptation methods; singular value decomposition (SVD) based adaptation and learning hidden unit contributions (LHUC).

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2018 IEEE Spoken Language Technology Workshop (SLT)

自引率

0.00%

发文量