Interdecoder: using Attention Decoders as Intermediate Regularization for CTC-Based Speech Recognition

2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2023-01-09 DOI:10.1109/SLT54892.2023.10022760

Tatsuya Komatsu, Yusuke Fujita

引用次数: 0

Abstract

We propose InterDecoder: a new non-autoregressive automatic speech recognition (NAR-ASR) training method that injects the advantage of token-wise autoregressive decoders while keeping the efficient non-autoregressive inference. The NAR-ASR models are often less accurate than autoregressive models such as Transformer decoder, which predict tokens conditioned on previously predicted tokens. The Inter-Decoder regularizes training by feeding intermediate encoder outputs into the decoder to compute the token-level prediction errors given previous ground-truth tokens, whereas the widely used Hybrid CTC/Attention model uses the decoder loss only at the final layer. In combination with Self-conditioned CTC, which uses the Intermediate CTC predictions to condition the encoder, performance is further improved. Experiments on the Librispeech and Tedlium2 dataset show that the proposed method shows a relative 6% WER improvement at the maximum compared to the conventional NAR-ASR methods.

查看原文本刊更多论文

间解码器:使用注意解码器作为基于ctc的语音识别的中间正则化

我们提出了一种新的非自回归自动语音识别(nr - asr)训练方法InterDecoder，它在保持有效的非自回归推理的同时注入了标记自回归解码器的优点。NAR-ASR模型通常不如自回归模型(如Transformer decoder)准确，自回归模型以先前预测的标记为条件预测标记。Inter-Decoder通过将中间编码器输出馈送到解码器来正则化训练，以计算给定先前基真令牌的令牌级预测误差，而广泛使用的Hybrid CTC/Attention模型仅在最后一层使用解码器损失。结合使用中间CTC预测来调节编码器的自条件CTC，进一步提高了性能。在librisspeech和Tedlium2数据集上的实验表明，与传统的ar - asr方法相比，该方法的最大识别率提高了6%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 IEEE Spoken Language Technology Workshop (SLT)

自引率

0.00%

发文量