{"title":"间解码器:使用注意解码器作为基于ctc的语音识别的中间正则化","authors":"Tatsuya Komatsu, Yusuke Fujita","doi":"10.1109/SLT54892.2023.10022760","DOIUrl":null,"url":null,"abstract":"We propose InterDecoder: a new non-autoregressive automatic speech recognition (NAR-ASR) training method that injects the advantage of token-wise autoregressive decoders while keeping the efficient non-autoregressive inference. The NAR-ASR models are often less accurate than autoregressive models such as Transformer decoder, which predict tokens conditioned on previously predicted tokens. The Inter-Decoder regularizes training by feeding intermediate encoder outputs into the decoder to compute the token-level prediction errors given previous ground-truth tokens, whereas the widely used Hybrid CTC/Attention model uses the decoder loss only at the final layer. In combination with Self-conditioned CTC, which uses the Intermediate CTC predictions to condition the encoder, performance is further improved. Experiments on the Librispeech and Tedlium2 dataset show that the proposed method shows a relative 6% WER improvement at the maximum compared to the conventional NAR-ASR methods.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Interdecoder: using Attention Decoders as Intermediate Regularization for CTC-Based Speech Recognition\",\"authors\":\"Tatsuya Komatsu, Yusuke Fujita\",\"doi\":\"10.1109/SLT54892.2023.10022760\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We propose InterDecoder: a new non-autoregressive automatic speech recognition (NAR-ASR) training method that injects the advantage of token-wise autoregressive decoders while keeping the efficient non-autoregressive inference. The NAR-ASR models are often less accurate than autoregressive models such as Transformer decoder, which predict tokens conditioned on previously predicted tokens. The Inter-Decoder regularizes training by feeding intermediate encoder outputs into the decoder to compute the token-level prediction errors given previous ground-truth tokens, whereas the widely used Hybrid CTC/Attention model uses the decoder loss only at the final layer. In combination with Self-conditioned CTC, which uses the Intermediate CTC predictions to condition the encoder, performance is further improved. Experiments on the Librispeech and Tedlium2 dataset show that the proposed method shows a relative 6% WER improvement at the maximum compared to the conventional NAR-ASR methods.\",\"PeriodicalId\":352002,\"journal\":{\"name\":\"2022 IEEE Spoken Language Technology Workshop (SLT)\",\"volume\":\"15 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-01-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE Spoken Language Technology Workshop (SLT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SLT54892.2023.10022760\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE Spoken Language Technology Workshop (SLT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SLT54892.2023.10022760","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Interdecoder: using Attention Decoders as Intermediate Regularization for CTC-Based Speech Recognition
We propose InterDecoder: a new non-autoregressive automatic speech recognition (NAR-ASR) training method that injects the advantage of token-wise autoregressive decoders while keeping the efficient non-autoregressive inference. The NAR-ASR models are often less accurate than autoregressive models such as Transformer decoder, which predict tokens conditioned on previously predicted tokens. The Inter-Decoder regularizes training by feeding intermediate encoder outputs into the decoder to compute the token-level prediction errors given previous ground-truth tokens, whereas the widely used Hybrid CTC/Attention model uses the decoder loss only at the final layer. In combination with Self-conditioned CTC, which uses the Intermediate CTC predictions to condition the encoder, performance is further improved. Experiments on the Librispeech and Tedlium2 dataset show that the proposed method shows a relative 6% WER improvement at the maximum compared to the conventional NAR-ASR methods.