长格式语音识别的端到端模型比较

C. Chiu, Wei Han, Yu Zhang, Ruoming Pang, S. Kishchenko, Patrick Nguyen, A. Narayanan, H. Liao, Shuyuan Zhang, Anjuli Kannan, Rohit Prabhavalkar, Z. Chen, Tara N. Sainath, Yonghui Wu
{"title":"长格式语音识别的端到端模型比较","authors":"C. Chiu, Wei Han, Yu Zhang, Ruoming Pang, S. Kishchenko, Patrick Nguyen, A. Narayanan, H. Liao, Shuyuan Zhang, Anjuli Kannan, Rohit Prabhavalkar, Z. Chen, Tara N. Sainath, Yonghui Wu","doi":"10.1109/ASRU46091.2019.9003854","DOIUrl":null,"url":null,"abstract":"End-to-end automatic speech recognition (ASR) models, including both attention-based models and the recurrent neural network transducer (RNN-T), have shown superior performance compared to conventional systems [1], [2]. However, previous studies have focused primarily on short utterances that typically last for just a few seconds or, at most, a few tens of seconds. Whether such architectures are practical on long utterances that last from minutes to hours remains an open question. In this paper, we both investigate and improve the performance of end-to-end models on long-form transcription. We first present an empirical comparison of different end-to-end models on a real world long-form task and demonstrate that the RNN-T model is much more robust than attention-based systems in this regime. We next explore two improvements to attention-based systems that significantly improve its performance: restricting the attention to be monotonic, and applying a novel decoding algorithm that breaks long utterances into shorter overlapping segments. Combining these two improvements, we show that attention-based end-to-end models can be very competitive to RNN-T on long-form speech recognition.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"187 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"69","resultStr":"{\"title\":\"A Comparison of End-to-End Models for Long-Form Speech Recognition\",\"authors\":\"C. Chiu, Wei Han, Yu Zhang, Ruoming Pang, S. Kishchenko, Patrick Nguyen, A. Narayanan, H. Liao, Shuyuan Zhang, Anjuli Kannan, Rohit Prabhavalkar, Z. Chen, Tara N. Sainath, Yonghui Wu\",\"doi\":\"10.1109/ASRU46091.2019.9003854\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"End-to-end automatic speech recognition (ASR) models, including both attention-based models and the recurrent neural network transducer (RNN-T), have shown superior performance compared to conventional systems [1], [2]. However, previous studies have focused primarily on short utterances that typically last for just a few seconds or, at most, a few tens of seconds. Whether such architectures are practical on long utterances that last from minutes to hours remains an open question. In this paper, we both investigate and improve the performance of end-to-end models on long-form transcription. We first present an empirical comparison of different end-to-end models on a real world long-form task and demonstrate that the RNN-T model is much more robust than attention-based systems in this regime. We next explore two improvements to attention-based systems that significantly improve its performance: restricting the attention to be monotonic, and applying a novel decoding algorithm that breaks long utterances into shorter overlapping segments. Combining these two improvements, we show that attention-based end-to-end models can be very competitive to RNN-T on long-form speech recognition.\",\"PeriodicalId\":150913,\"journal\":{\"name\":\"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)\",\"volume\":\"187 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-11-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"69\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ASRU46091.2019.9003854\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASRU46091.2019.9003854","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 69

摘要

端到端自动语音识别(ASR)模型,包括基于注意力的模型和递归神经网络换能器(RNN-T),与传统系统相比表现出优越的性能[1],[2]。然而,之前的研究主要集中在通常只持续几秒或最多几十秒的简短话语上。这种架构是否适用于持续几分钟到几小时的长话语仍然是一个悬而未决的问题。在本文中,我们研究并改进了长格式转录的端到端模型的性能。我们首先在现实世界的长格式任务上对不同的端到端模型进行了实证比较,并证明在这种情况下,RNN-T模型比基于注意力的系统更健壮。接下来,我们将探索对基于注意力的系统的两项改进,以显著提高其性能:将注意力限制为单调的,并应用一种新的解码算法,将长话语分解为较短的重叠片段。结合这两项改进,我们表明基于注意力的端到端模型在长格式语音识别方面可以与RNN-T非常有竞争力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
A Comparison of End-to-End Models for Long-Form Speech Recognition
End-to-end automatic speech recognition (ASR) models, including both attention-based models and the recurrent neural network transducer (RNN-T), have shown superior performance compared to conventional systems [1], [2]. However, previous studies have focused primarily on short utterances that typically last for just a few seconds or, at most, a few tens of seconds. Whether such architectures are practical on long utterances that last from minutes to hours remains an open question. In this paper, we both investigate and improve the performance of end-to-end models on long-form transcription. We first present an empirical comparison of different end-to-end models on a real world long-form task and demonstrate that the RNN-T model is much more robust than attention-based systems in this regime. We next explore two improvements to attention-based systems that significantly improve its performance: restricting the attention to be monotonic, and applying a novel decoding algorithm that breaks long utterances into shorter overlapping segments. Combining these two improvements, we show that attention-based end-to-end models can be very competitive to RNN-T on long-form speech recognition.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信