语音识别的联合瓶颈特征与注意模型

Proceedings of 2018 International Conference on Mathematics and Artificial Intelligence Pub Date : 2018-04-20 DOI:10.1145/3208788.3208798

Long Xingyan, Qu Dan

{"title":"语音识别的联合瓶颈特征与注意模型","authors":"Long Xingyan, Qu Dan","doi":"10.1145/3208788.3208798","DOIUrl":null,"url":null,"abstract":"Recently, attention based sequence-to-sequence model become a research hotspot in speech recognition. The attention model has the problem of slow convergence and poor robustness. In this paper, a model that jointed a bottleneck feature extraction network and attention model is proposed. The model is composed of a Deep Belief Network as bottleneck feature extraction network and an attention-based encoder-decoder model. DBN can store the priori information from Hidden Markov Model so that increasing convergence speed of and enhancing both robustness and discrimination of features. Attention model utilizes the temporal information of feature sequence to calculate the posterior probability of phoneme. Then the number of stack recurrent neural network layers in attention model is reduced in order to decrease the calculation of gradient. Experiments in the TIMIT corpus showed that the phoneme error rate is 17.80% in test set, the average training iteration decreased 52%, and the number of training iterations decreased from 139 to 89. The word error rate of WSJ eval92 is 12.9% without any external language model.","PeriodicalId":211585,"journal":{"name":"Proceedings of 2018 International Conference on Mathematics and Artificial Intelligence","volume":"44 3","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Joint bottleneck feature and attention model for speech recognition\",\"authors\":\"Long Xingyan, Qu Dan\",\"doi\":\"10.1145/3208788.3208798\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recently, attention based sequence-to-sequence model become a research hotspot in speech recognition. The attention model has the problem of slow convergence and poor robustness. In this paper, a model that jointed a bottleneck feature extraction network and attention model is proposed. The model is composed of a Deep Belief Network as bottleneck feature extraction network and an attention-based encoder-decoder model. DBN can store the priori information from Hidden Markov Model so that increasing convergence speed of and enhancing both robustness and discrimination of features. Attention model utilizes the temporal information of feature sequence to calculate the posterior probability of phoneme. Then the number of stack recurrent neural network layers in attention model is reduced in order to decrease the calculation of gradient. Experiments in the TIMIT corpus showed that the phoneme error rate is 17.80% in test set, the average training iteration decreased 52%, and the number of training iterations decreased from 139 to 89. The word error rate of WSJ eval92 is 12.9% without any external language model.\",\"PeriodicalId\":211585,\"journal\":{\"name\":\"Proceedings of 2018 International Conference on Mathematics and Artificial Intelligence\",\"volume\":\"44 3\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-04-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of 2018 International Conference on Mathematics and Artificial Intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3208788.3208798\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of 2018 International Conference on Mathematics and Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3208788.3208798","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

摘要

近年来，基于注意力的序列到序列模型成为语音识别领域的研究热点。注意模型存在收敛速度慢、鲁棒性差的问题。本文提出了一个瓶颈特征提取网络与注意力模型相结合的模型。该模型由深度信念网络作为瓶颈特征提取网络和基于注意力的编码器-解码器模型组成。DBN存储了隐马尔可夫模型的先验信息，提高了算法的收敛速度，增强了特征的鲁棒性和识别能力。注意模型利用特征序列的时间信息来计算音素的后验概率。然后减少了注意模型中堆栈递归神经网络的层数，以减少梯度的计算。在TIMIT语料库上的实验表明，测试集的音素错误率为17.80%，平均训练迭代次数减少52%，训练迭代次数从139次减少到89次。在没有任何外部语言模型的情况下，WSJ eval92的单词错误率为12.9%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Joint bottleneck feature and attention model for speech recognition

Recently, attention based sequence-to-sequence model become a research hotspot in speech recognition. The attention model has the problem of slow convergence and poor robustness. In this paper, a model that jointed a bottleneck feature extraction network and attention model is proposed. The model is composed of a Deep Belief Network as bottleneck feature extraction network and an attention-based encoder-decoder model. DBN can store the priori information from Hidden Markov Model so that increasing convergence speed of and enhancing both robustness and discrimination of features. Attention model utilizes the temporal information of feature sequence to calculate the posterior probability of phoneme. Then the number of stack recurrent neural network layers in attention model is reduced in order to decrease the calculation of gradient. Experiments in the TIMIT corpus showed that the phoneme error rate is 17.80% in test set, the average training iteration decreased 52%, and the number of training iterations decreased from 139 to 89. The word error rate of WSJ eval92 is 12.9% without any external language model.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of 2018 International Conference on Mathematics and Artificial Intelligence

自引率

0.00%

发文量