降低全词模型的计算复杂度

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI:10.1109/ASRU.2017.8268917

H. Soltau, H. Liao, H. Sak

{"title":"降低全词模型的计算复杂度","authors":"H. Soltau, H. Liao, H. Sak","doi":"10.1109/ASRU.2017.8268917","DOIUrl":null,"url":null,"abstract":"In a previous study, we demonstrated the feasibility to build a competitive, greatly simplified, large vocabulary continuous speech recognition system with whole words as acoustic units. In that system, we model about 100,000 words directly using deep bi-directional LSTM RNNs. To alleviate the data sparsity problem for word models, we train the model on 125,000 hours of semi-supervised acoustic training data. The resulting model works very well as an end-to-end all-neural speech recognition model without the use of any language model removing the need to decode. However, the very large output layer increases the computational cost substantially. In this work we address this issue by adding TDNN (Time Delay Neural Network) layers that reduce the frame rate to 120ms for the output layer. The TDNN layers are interspersed with the LSTM layers, gradually reducing the frame rate from 10ms to 120ms. The new model reduces the computational cost by 60% while improving the word error rate by 6% relative. Compared to a traditional LVCSR system, the whole word speech recognizer uses about the same CPU cycles and can easily be parallelized across CPU cores or run on GPUs.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":"{\"title\":\"Reducing the computational complexity for whole word models\",\"authors\":\"H. Soltau, H. Liao, H. Sak\",\"doi\":\"10.1109/ASRU.2017.8268917\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In a previous study, we demonstrated the feasibility to build a competitive, greatly simplified, large vocabulary continuous speech recognition system with whole words as acoustic units. In that system, we model about 100,000 words directly using deep bi-directional LSTM RNNs. To alleviate the data sparsity problem for word models, we train the model on 125,000 hours of semi-supervised acoustic training data. The resulting model works very well as an end-to-end all-neural speech recognition model without the use of any language model removing the need to decode. However, the very large output layer increases the computational cost substantially. In this work we address this issue by adding TDNN (Time Delay Neural Network) layers that reduce the frame rate to 120ms for the output layer. The TDNN layers are interspersed with the LSTM layers, gradually reducing the frame rate from 10ms to 120ms. The new model reduces the computational cost by 60% while improving the word error rate by 6% relative. Compared to a traditional LVCSR system, the whole word speech recognizer uses about the same CPU cycles and can easily be parallelized across CPU cores or run on GPUs.\",\"PeriodicalId\":290868,\"journal\":{\"name\":\"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)\",\"volume\":\"15 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"13\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ASRU.2017.8268917\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASRU.2017.8268917","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

摘要

在之前的研究中，我们证明了以整个单词为声学单位构建具有竞争力的、大大简化的、大词汇量的连续语音识别系统的可行性。在该系统中，我们直接使用深度双向LSTM rnn对大约10万个单词进行建模。为了缓解词模型的数据稀疏性问题，我们在125,000小时的半监督声学训练数据上训练模型。最终的模型作为一个端到端的全神经语音识别模型工作得非常好，而不需要使用任何语言模型来消除解码的需要。然而，非常大的输出层大大增加了计算成本。在这项工作中，我们通过添加TDNN(时间延迟神经网络)层来解决这个问题，该层将输出层的帧率降低到120ms。TDNN层与LSTM层穿插，帧率从10ms逐渐降低到120ms。新模型的计算成本降低了60%，错误率相对提高了6%。与传统的LVCSR系统相比，整个单词语音识别器使用大约相同的CPU周期，并且可以轻松地跨CPU内核并行化或在gpu上运行。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Reducing the computational complexity for whole word models

In a previous study, we demonstrated the feasibility to build a competitive, greatly simplified, large vocabulary continuous speech recognition system with whole words as acoustic units. In that system, we model about 100,000 words directly using deep bi-directional LSTM RNNs. To alleviate the data sparsity problem for word models, we train the model on 125,000 hours of semi-supervised acoustic training data. The resulting model works very well as an end-to-end all-neural speech recognition model without the use of any language model removing the need to decode. However, the very large output layer increases the computational cost substantially. In this work we address this issue by adding TDNN (Time Delay Neural Network) layers that reduce the frame rate to 120ms for the output layer. The TDNN layers are interspersed with the LSTM layers, gradually reducing the frame rate from 10ms to 120ms. The new model reduces the computational cost by 60% while improving the word error rate by 6% relative. Compared to a traditional LVCSR system, the whole word speech recognizer uses about the same CPU cycles and can easily be parallelized across CPU cores or run on GPUs.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

自引率

0.00%

发文量