M. Ghodsi, Xiaofeng Liu, J. Apfel, Rodrigo Cabrera, Eugene Weinstein
{"title":"基于无状态预测网络的rnn换能器","authors":"M. Ghodsi, Xiaofeng Liu, J. Apfel, Rodrigo Cabrera, Eugene Weinstein","doi":"10.1109/ICASSP40776.2020.9054419","DOIUrl":null,"url":null,"abstract":"The RNN-Transducer (RNNT) outperforms classic Automatic Speech Recognition (ASR) systems when a large amount of supervised training data is available. For low-resource languages, the RNNT models overfit, and can not directly take advantage of additional large text corpora as in classic ASR systems.We focus on the prediction network of the RNNT, since it is believed to be analogous to the Language Model (LM) in the classic ASR systems. We pre-train the prediction network with text-only data, which is not helpful. Moreover, removing the recurrent layers from the prediction network, which makes the prediction network stateless, performs virtually as well as the original RNNT model, when using wordpieces. The stateless prediction network does not depend on the previous output symbols, except the last one. Therefore it simplifies the RNNT architectures and the inference.Our results suggest that the RNNT prediction network does not function as the LM in classical ASR. Instead, it merely helps the model align to the input audio, while the RNNT encoder and joint networks capture both the acoustic and the linguistic information.","PeriodicalId":13127,"journal":{"name":"ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"39 1","pages":"7049-7053"},"PeriodicalIF":0.0000,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"77","resultStr":"{\"title\":\"Rnn-Transducer with Stateless Prediction Network\",\"authors\":\"M. Ghodsi, Xiaofeng Liu, J. Apfel, Rodrigo Cabrera, Eugene Weinstein\",\"doi\":\"10.1109/ICASSP40776.2020.9054419\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The RNN-Transducer (RNNT) outperforms classic Automatic Speech Recognition (ASR) systems when a large amount of supervised training data is available. For low-resource languages, the RNNT models overfit, and can not directly take advantage of additional large text corpora as in classic ASR systems.We focus on the prediction network of the RNNT, since it is believed to be analogous to the Language Model (LM) in the classic ASR systems. We pre-train the prediction network with text-only data, which is not helpful. Moreover, removing the recurrent layers from the prediction network, which makes the prediction network stateless, performs virtually as well as the original RNNT model, when using wordpieces. The stateless prediction network does not depend on the previous output symbols, except the last one. Therefore it simplifies the RNNT architectures and the inference.Our results suggest that the RNNT prediction network does not function as the LM in classical ASR. Instead, it merely helps the model align to the input audio, while the RNNT encoder and joint networks capture both the acoustic and the linguistic information.\",\"PeriodicalId\":13127,\"journal\":{\"name\":\"ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)\",\"volume\":\"39 1\",\"pages\":\"7049-7053\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"77\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICASSP40776.2020.9054419\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICASSP40776.2020.9054419","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
The RNN-Transducer (RNNT) outperforms classic Automatic Speech Recognition (ASR) systems when a large amount of supervised training data is available. For low-resource languages, the RNNT models overfit, and can not directly take advantage of additional large text corpora as in classic ASR systems.We focus on the prediction network of the RNNT, since it is believed to be analogous to the Language Model (LM) in the classic ASR systems. We pre-train the prediction network with text-only data, which is not helpful. Moreover, removing the recurrent layers from the prediction network, which makes the prediction network stateless, performs virtually as well as the original RNNT model, when using wordpieces. The stateless prediction network does not depend on the previous output symbols, except the last one. Therefore it simplifies the RNNT architectures and the inference.Our results suggest that the RNNT prediction network does not function as the LM in classical ASR. Instead, it merely helps the model align to the input audio, while the RNNT encoder and joint networks capture both the acoustic and the linguistic information.