{"title":"Monotonic Recurrent Neural Network Transducer and Decoding Strategies","authors":"Anshuman Tripathi, Han Lu, H. Sak, H. Soltau","doi":"10.1109/ASRU46091.2019.9003822","DOIUrl":null,"url":null,"abstract":"Recurrent Neural Network Transducer (RNNT) is an end-to-end model which transduces discrete input sequences to output sequences by learning alignments between the sequences. In speech recognition tasks we generally have a strictly monotonic alignment between time frames and label sequence. However, the standard RNNT loss does not enforce this constraint. This can cause some anomalies in alignments such as the model outputting a sequence of labels at a single time frame. There is also no bound on the decoding time steps. To address these problems, we introduce a monotonic version of the RNNT loss. Under the assumption that the output sequence is not longer than the input sequence, this loss can be used with forward-backward algorithm to learn strictly monotonic alignments between the sequences. We present experimental studies showing that speech recognition accuracy for monotonic RNNT is equivalent to standard RNNT. We also explore best-first and breadth-first decoding strategies for both monotonic and standard RNNT models. Our experiments show that breadth-first search is effective in exploring and combining alternative alignments. Additionally, it also allows batching of hypotheses during search label expansion, allowing better resource utilization, and resulting in decoding speedup.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"84 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"38","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASRU46091.2019.9003822","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 38
Abstract
Recurrent Neural Network Transducer (RNNT) is an end-to-end model which transduces discrete input sequences to output sequences by learning alignments between the sequences. In speech recognition tasks we generally have a strictly monotonic alignment between time frames and label sequence. However, the standard RNNT loss does not enforce this constraint. This can cause some anomalies in alignments such as the model outputting a sequence of labels at a single time frame. There is also no bound on the decoding time steps. To address these problems, we introduce a monotonic version of the RNNT loss. Under the assumption that the output sequence is not longer than the input sequence, this loss can be used with forward-backward algorithm to learn strictly monotonic alignments between the sequences. We present experimental studies showing that speech recognition accuracy for monotonic RNNT is equivalent to standard RNNT. We also explore best-first and breadth-first decoding strategies for both monotonic and standard RNNT models. Our experiments show that breadth-first search is effective in exploring and combining alternative alignments. Additionally, it also allows batching of hypotheses during search label expansion, allowing better resource utilization, and resulting in decoding speedup.