长格式语音识别的端到端模型比较

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-11-06 DOI:10.1109/ASRU46091.2019.9003854

C. Chiu, Wei Han, Yu Zhang, Ruoming Pang, S. Kishchenko, Patrick Nguyen, A. Narayanan, H. Liao, Shuyuan Zhang, Anjuli Kannan, Rohit Prabhavalkar, Z. Chen, Tara N. Sainath, Yonghui Wu

{"title":"长格式语音识别的端到端模型比较","authors":"C. Chiu, Wei Han, Yu Zhang, Ruoming Pang, S. Kishchenko, Patrick Nguyen, A. Narayanan, H. Liao, Shuyuan Zhang, Anjuli Kannan, Rohit Prabhavalkar, Z. Chen, Tara N. Sainath, Yonghui Wu","doi":"10.1109/ASRU46091.2019.9003854","DOIUrl":null,"url":null,"abstract":"End-to-end automatic speech recognition (ASR) models, including both attention-based models and the recurrent neural network transducer (RNN-T), have shown superior performance compared to conventional systems [1], [2]. However, previous studies have focused primarily on short utterances that typically last for just a few seconds or, at most, a few tens of seconds. Whether such architectures are practical on long utterances that last from minutes to hours remains an open question. In this paper, we both investigate and improve the performance of end-to-end models on long-form transcription. We first present an empirical comparison of different end-to-end models on a real world long-form task and demonstrate that the RNN-T model is much more robust than attention-based systems in this regime. We next explore two improvements to attention-based systems that significantly improve its performance: restricting the attention to be monotonic, and applying a novel decoding algorithm that breaks long utterances into shorter overlapping segments. Combining these two improvements, we show that attention-based end-to-end models can be very competitive to RNN-T on long-form speech recognition.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"187 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"69","resultStr":"{\"title\":\"A Comparison of End-to-End Models for Long-Form Speech Recognition\",\"authors\":\"C. Chiu, Wei Han, Yu Zhang, Ruoming Pang, S. Kishchenko, Patrick Nguyen, A. Narayanan, H. Liao, Shuyuan Zhang, Anjuli Kannan, Rohit Prabhavalkar, Z. Chen, Tara N. Sainath, Yonghui Wu\",\"doi\":\"10.1109/ASRU46091.2019.9003854\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"End-to-end automatic speech recognition (ASR) models, including both attention-based models and the recurrent neural network transducer (RNN-T), have shown superior performance compared to conventional systems [1], [2]. However, previous studies have focused primarily on short utterances that typically last for just a few seconds or, at most, a few tens of seconds. Whether such architectures are practical on long utterances that last from minutes to hours remains an open question. In this paper, we both investigate and improve the performance of end-to-end models on long-form transcription. We first present an empirical comparison of different end-to-end models on a real world long-form task and demonstrate that the RNN-T model is much more robust than attention-based systems in this regime. We next explore two improvements to attention-based systems that significantly improve its performance: restricting the attention to be monotonic, and applying a novel decoding algorithm that breaks long utterances into shorter overlapping segments. Combining these two improvements, we show that attention-based end-to-end models can be very competitive to RNN-T on long-form speech recognition.\",\"PeriodicalId\":150913,\"journal\":{\"name\":\"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)\",\"volume\":\"187 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-11-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"69\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ASRU46091.2019.9003854\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASRU46091.2019.9003854","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 69

摘要

端到端自动语音识别(ASR)模型，包括基于注意力的模型和递归神经网络换能器(RNN-T)，与传统系统相比表现出优越的性能[1]，[2]。然而，之前的研究主要集中在通常只持续几秒或最多几十秒的简短话语上。这种架构是否适用于持续几分钟到几小时的长话语仍然是一个悬而未决的问题。在本文中，我们研究并改进了长格式转录的端到端模型的性能。我们首先在现实世界的长格式任务上对不同的端到端模型进行了实证比较，并证明在这种情况下，RNN-T模型比基于注意力的系统更健壮。接下来，我们将探索对基于注意力的系统的两项改进，以显著提高其性能:将注意力限制为单调的，并应用一种新的解码算法，将长话语分解为较短的重叠片段。结合这两项改进，我们表明基于注意力的端到端模型在长格式语音识别方面可以与RNN-T非常有竞争力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Comparison of End-to-End Models for Long-Form Speech Recognition

End-to-end automatic speech recognition (ASR) models, including both attention-based models and the recurrent neural network transducer (RNN-T), have shown superior performance compared to conventional systems [1], [2]. However, previous studies have focused primarily on short utterances that typically last for just a few seconds or, at most, a few tens of seconds. Whether such architectures are practical on long utterances that last from minutes to hours remains an open question. In this paper, we both investigate and improve the performance of end-to-end models on long-form transcription. We first present an empirical comparison of different end-to-end models on a real world long-form task and demonstrate that the RNN-T model is much more robust than attention-based systems in this regime. We next explore two improvements to attention-based systems that significantly improve its performance: restricting the attention to be monotonic, and applying a novel decoding algorithm that breaks long utterances into shorter overlapping segments. Combining these two improvements, we show that attention-based end-to-end models can be very competitive to RNN-T on long-form speech recognition.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

自引率

0.00%

发文量