Monotonic Recurrent Neural Network Transducer and Decoding Strategies

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI:10.1109/ASRU46091.2019.9003822

Anshuman Tripathi, Han Lu, H. Sak, H. Soltau

{"title":"Monotonic Recurrent Neural Network Transducer and Decoding Strategies","authors":"Anshuman Tripathi, Han Lu, H. Sak, H. Soltau","doi":"10.1109/ASRU46091.2019.9003822","DOIUrl":null,"url":null,"abstract":"Recurrent Neural Network Transducer (RNNT) is an end-to-end model which transduces discrete input sequences to output sequences by learning alignments between the sequences. In speech recognition tasks we generally have a strictly monotonic alignment between time frames and label sequence. However, the standard RNNT loss does not enforce this constraint. This can cause some anomalies in alignments such as the model outputting a sequence of labels at a single time frame. There is also no bound on the decoding time steps. To address these problems, we introduce a monotonic version of the RNNT loss. Under the assumption that the output sequence is not longer than the input sequence, this loss can be used with forward-backward algorithm to learn strictly monotonic alignments between the sequences. We present experimental studies showing that speech recognition accuracy for monotonic RNNT is equivalent to standard RNNT. We also explore best-first and breadth-first decoding strategies for both monotonic and standard RNNT models. Our experiments show that breadth-first search is effective in exploring and combining alternative alignments. Additionally, it also allows batching of hypotheses during search label expansion, allowing better resource utilization, and resulting in decoding speedup.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"84 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"38","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASRU46091.2019.9003822","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 38

Abstract

Recurrent Neural Network Transducer (RNNT) is an end-to-end model which transduces discrete input sequences to output sequences by learning alignments between the sequences. In speech recognition tasks we generally have a strictly monotonic alignment between time frames and label sequence. However, the standard RNNT loss does not enforce this constraint. This can cause some anomalies in alignments such as the model outputting a sequence of labels at a single time frame. There is also no bound on the decoding time steps. To address these problems, we introduce a monotonic version of the RNNT loss. Under the assumption that the output sequence is not longer than the input sequence, this loss can be used with forward-backward algorithm to learn strictly monotonic alignments between the sequences. We present experimental studies showing that speech recognition accuracy for monotonic RNNT is equivalent to standard RNNT. We also explore best-first and breadth-first decoding strategies for both monotonic and standard RNNT models. Our experiments show that breadth-first search is effective in exploring and combining alternative alignments. Additionally, it also allows batching of hypotheses during search label expansion, allowing better resource utilization, and resulting in decoding speedup.

查看原文本刊更多论文

单调递归神经网络传感器与解码策略

递归神经网络传感器(RNNT)是一种端到端模型，它通过学习序列之间的对齐，将离散输入序列转换为输出序列。在语音识别任务中，我们通常在时间框架和标签序列之间有严格的单调对齐。然而，标准的RNNT损耗并没有强制执行这个约束。这可能会导致对齐中的一些异常，例如模型在单个时间框架中输出一系列标签。解码的时间步长也没有限制。为了解决这些问题，我们引入了RNNT损耗的单调版本。在假设输出序列不长于输入序列的情况下，该损失可以与正-倒向算法一起用于学习序列之间的严格单调排列。我们的实验研究表明，单调RNNT的语音识别精度与标准RNNT相当。我们还探讨了单调和标准RNNT模型的最佳优先和宽度优先解码策略。我们的实验表明，宽度优先搜索在探索和组合备选对齐方面是有效的。此外，它还允许在搜索标签扩展过程中对假设进行批处理，从而更好地利用资源，从而提高解码速度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

自引率

0.00%

发文量