Ge Zhang, Pengyuan Zhang, Jielin Pan, Yonghong Yan
{"title":"Fast variable-frame-rate decoding of speech recognition based on deep neural networks","authors":"Ge Zhang, Pengyuan Zhang, Jielin Pan, Yonghong Yan","doi":"10.1109/FSKD.2017.8393381","DOIUrl":null,"url":null,"abstract":"Deep neural networks (DNN) have recently shown impressive performance as acoustic models for large vocabulary continuous speech recognition (LVCSR) tasks. Typically, the frame shift of the output of neural networks is much shorter than the average length of the modeling units, so the posterior vectors of neighbouring frames are likely to be similar. The similarity, together with the better discrimination of neural networks than typical acoustic models, shows a possibility of removing frames of neural network outputs according to the distance of posterior vectors. Then, the computation costs of beam searching can be effectively reduced. Based on that, the paper introduces a novel variable-frame-rate decoding approach based on neural network computation that accelerates the beam searching for speech recognition with minor loss of accuracy. By computing the distances of posterior vectors and removing frames with a posterior vector similar to the previous frame, the approach can make use of redundant information between frames and do a much quicker beam searching. Experiments on LVCSR tasks show a 2.4-times speed up of decoding compared to the typical framewise decoding implementation.","PeriodicalId":236093,"journal":{"name":"2017 13th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD)","volume":"128 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 13th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FSKD.2017.8393381","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Deep neural networks (DNN) have recently shown impressive performance as acoustic models for large vocabulary continuous speech recognition (LVCSR) tasks. Typically, the frame shift of the output of neural networks is much shorter than the average length of the modeling units, so the posterior vectors of neighbouring frames are likely to be similar. The similarity, together with the better discrimination of neural networks than typical acoustic models, shows a possibility of removing frames of neural network outputs according to the distance of posterior vectors. Then, the computation costs of beam searching can be effectively reduced. Based on that, the paper introduces a novel variable-frame-rate decoding approach based on neural network computation that accelerates the beam searching for speech recognition with minor loss of accuracy. By computing the distances of posterior vectors and removing frames with a posterior vector similar to the previous frame, the approach can make use of redundant information between frames and do a much quicker beam searching. Experiments on LVCSR tasks show a 2.4-times speed up of decoding compared to the typical framewise decoding implementation.