用于语音识别的简化LSTMS

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI:10.1109/ASRU46091.2019.9003898

G. Saon, Zoltán Tüske, Kartik Audhkhasi, Brian Kingsbury, M. Picheny, Samuel Thomas

{"title":"用于语音识别的简化LSTMS","authors":"G. Saon, Zoltán Tüske, Kartik Audhkhasi, Brian Kingsbury, M. Picheny, Samuel Thomas","doi":"10.1109/ASRU46091.2019.9003898","DOIUrl":null,"url":null,"abstract":"In this paper we explore new variants of Long Short-Term Memory (LSTM) networks for sequential modeling of acoustic features. In particular, we show that: (i) removing the output gate, (ii) replacing the hyperbolic tangent nonlinearity at the cell output with hard tanh, and (iii) collapsing the cell and hidden state vectors leads to a model that is conceptually simpler than and comparable in effectiveness to a regular LSTM for speech recognition. The proposed model has 25% fewer parameters than an LSTM with the same number of cells, trains faster because it has larger gradients leading to larger steps in weight space, and reaches a better optimum because there are fewer nonlinearities to traverse across layers. We report experimental results for both hybrid and CTC acoustic models on three publicly available English datasets: Switchboard 300 hours telephone conversations, 400 hours broadcast news transcription, and the MALACH 176 hours corpus of Holocaust survivor testimonies. In all cases the proposed models achieve similar or better accuracy than regular LSTMs while being conceptually simpler.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Simplified LSTMS for Speech Recognition\",\"authors\":\"G. Saon, Zoltán Tüske, Kartik Audhkhasi, Brian Kingsbury, M. Picheny, Samuel Thomas\",\"doi\":\"10.1109/ASRU46091.2019.9003898\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper we explore new variants of Long Short-Term Memory (LSTM) networks for sequential modeling of acoustic features. In particular, we show that: (i) removing the output gate, (ii) replacing the hyperbolic tangent nonlinearity at the cell output with hard tanh, and (iii) collapsing the cell and hidden state vectors leads to a model that is conceptually simpler than and comparable in effectiveness to a regular LSTM for speech recognition. The proposed model has 25% fewer parameters than an LSTM with the same number of cells, trains faster because it has larger gradients leading to larger steps in weight space, and reaches a better optimum because there are fewer nonlinearities to traverse across layers. We report experimental results for both hybrid and CTC acoustic models on three publicly available English datasets: Switchboard 300 hours telephone conversations, 400 hours broadcast news transcription, and the MALACH 176 hours corpus of Holocaust survivor testimonies. In all cases the proposed models achieve similar or better accuracy than regular LSTMs while being conceptually simpler.\",\"PeriodicalId\":150913,\"journal\":{\"name\":\"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)\",\"volume\":\"21 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ASRU46091.2019.9003898\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASRU46091.2019.9003898","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

在本文中，我们探索了用于声学特征序列建模的长短期记忆(LSTM)网络的新变体。特别是，我们表明:(i)去除输出门，(ii)用硬tanh替换单元输出处的双曲正切非线性，以及(iii)折叠单元和隐藏状态向量导致模型在概念上比用于语音识别的常规LSTM更简单，并且在有效性上与之相当。该模型的参数比具有相同单元数的LSTM少25%，训练速度更快，因为它具有更大的梯度，导致权重空间的步长更大，并且达到了更好的优化，因为跨层的非线性更少。我们报告了混合和CTC声学模型在三个公开可用的英语数据集上的实验结果:总机300小时电话交谈，400小时广播新闻转录，以及MALACH 176小时大屠杀幸存者证词语料。在所有情况下，所提出的模型在概念上更简单的同时实现了与常规lstm相似或更好的精度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Simplified LSTMS for Speech Recognition

In this paper we explore new variants of Long Short-Term Memory (LSTM) networks for sequential modeling of acoustic features. In particular, we show that: (i) removing the output gate, (ii) replacing the hyperbolic tangent nonlinearity at the cell output with hard tanh, and (iii) collapsing the cell and hidden state vectors leads to a model that is conceptually simpler than and comparable in effectiveness to a regular LSTM for speech recognition. The proposed model has 25% fewer parameters than an LSTM with the same number of cells, trains faster because it has larger gradients leading to larger steps in weight space, and reaches a better optimum because there are fewer nonlinearities to traverse across layers. We report experimental results for both hybrid and CTC acoustic models on three publicly available English datasets: Switchboard 300 hours telephone conversations, 400 hours broadcast news transcription, and the MALACH 176 hours corpus of Holocaust survivor testimonies. In all cases the proposed models achieve similar or better accuracy than regular LSTMs while being conceptually simpler.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

自引率

0.00%

发文量