On Comparison of Encoders for Attention based End to End Speech Recognition in Standalone and Rescoring Mode

2022 IEEE International Conference on Signal Processing and Communications (SPCOM) Pub Date : 2022-06-26 DOI:10.1109/SPCOM55316.2022.9840823

Raviraj Joshi, Subodh Kumar

{"title":"On Comparison of Encoders for Attention based End to End Speech Recognition in Standalone and Rescoring Mode","authors":"Raviraj Joshi, Subodh Kumar","doi":"10.1109/SPCOM55316.2022.9840823","DOIUrl":null,"url":null,"abstract":"The streaming automatic speech recognition (ASR) models are more popular and suitable for voice-based applications. However, non-streaming models provide better performance as they look at the entire audio context. To leverage the benefits of the non-streaming model in streaming applications like voice search, it is commonly used in second pass re-scoring mode. The candidate hypothesis generated using steaming models is re-scored using a non-streaming model.In this work, we evaluate the non-streaming attention-based end-to-end ASR models on the Flipkart voice search task in both standalone and re-scoring modes. These models are based on Listen-Attend-Spell (LAS) encoder-decoder architecture. We experiment with different encoder variations based on LSTM, Transformer, and Conformer. We compare the latency requirements of these models along with their performance. Overall we show that the Transformer model offers acceptable WER with the lowest latency requirements. We report a relative WER improvement of around 16% with the second pass LAS rescoring with latency overhead under 5ms. We also highlight the importance of CNN front-end with Transformer architecture to achieve comparable word error rates (WER). Moreover, we observe that in the second pass re-scoring mode all the encoders provide similar benefits whereas the difference in performance is prominent in standalone text generation mode.","PeriodicalId":246982,"journal":{"name":"2022 IEEE International Conference on Signal Processing and Communications (SPCOM)","volume":"128 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Conference on Signal Processing and Communications (SPCOM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SPCOM55316.2022.9840823","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

The streaming automatic speech recognition (ASR) models are more popular and suitable for voice-based applications. However, non-streaming models provide better performance as they look at the entire audio context. To leverage the benefits of the non-streaming model in streaming applications like voice search, it is commonly used in second pass re-scoring mode. The candidate hypothesis generated using steaming models is re-scored using a non-streaming model.In this work, we evaluate the non-streaming attention-based end-to-end ASR models on the Flipkart voice search task in both standalone and re-scoring modes. These models are based on Listen-Attend-Spell (LAS) encoder-decoder architecture. We experiment with different encoder variations based on LSTM, Transformer, and Conformer. We compare the latency requirements of these models along with their performance. Overall we show that the Transformer model offers acceptable WER with the lowest latency requirements. We report a relative WER improvement of around 16% with the second pass LAS rescoring with latency overhead under 5ms. We also highlight the importance of CNN front-end with Transformer architecture to achieve comparable word error rates (WER). Moreover, we observe that in the second pass re-scoring mode all the encoders provide similar benefits whereas the difference in performance is prominent in standalone text generation mode.

查看原文本刊更多论文

基于注意力的端到端语音识别独立模式和记分模式的编码器比较

流自动语音识别(ASR)模型更受欢迎，适用于基于语音的应用。然而，非流模型提供了更好的性能，因为它们着眼于整个音频环境。为了在流应用程序(如语音搜索)中利用非流模型的优势，它通常用于第二遍重新评分模式。使用蒸汽模型生成的候选假设使用非流模型重新评分。在这项工作中，我们评估了Flipkart语音搜索任务在独立和重新评分模式下的非流媒体基于注意力的端到端ASR模型。这些模型是基于“听-参与-拼写”(LAS)编码器-解码器架构的。我们尝试了基于LSTM、Transformer和Conformer的不同编码器变体。我们比较了这些模型的延迟需求及其性能。总的来说，我们展示了Transformer模型提供了具有最低延迟需求的可接受的WER。我们报告了第二次通过LAS评分的相对WER提高了约16%，延迟开销低于5ms。我们还强调了具有Transformer架构的CNN前端对于实现可比较的单词错误率(WER)的重要性。此外，我们观察到，在第二次重新评分模式下，所有编码器都提供了类似的好处，而在独立文本生成模式下，性能差异很突出。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 IEEE International Conference on Signal Processing and Communications (SPCOM)

自引率

0.00%

发文量