Cascade RNN-Transducer: Syllable Based Streaming On-Device Mandarin Speech Recognition with a Syllable-To-Character Converter

2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2020-11-17 DOI:10.1109/SLT48900.2021.9383506

Xiong Wang, Zhuoyuan Yao, Xian Shi, Lei Xie

{"title":"Cascade RNN-Transducer: Syllable Based Streaming On-Device Mandarin Speech Recognition with a Syllable-To-Character Converter","authors":"Xiong Wang, Zhuoyuan Yao, Xian Shi, Lei Xie","doi":"10.1109/SLT48900.2021.9383506","DOIUrl":null,"url":null,"abstract":"End-to-end models are favored in automatic speech recognition (ASR) because of its simplified system structure and superior performance. Among these models, recurrent neural network transducer (RNN-T) has achieved significant progress in streaming on-device speech recognition because of its high-accuracy and low-latency. RNN-T adopts a prediction network to enhance language information, but its language modeling ability is limited because it still needs paired speech-text data to train. Further strengthening the language modeling ability through extra text data, such as shallow fusion with an external language model, only brings a small performance gain. In view of the fact that Mandarin Chinese is a character-based language and each character is pronounced as a tonal syllable, this paper proposes a novel cascade RNN-T approach to improve the language modeling ability of RNN-T. Our approach firstly uses an RNN-T to transform acoustic feature into syllable sequence, and then converts the syllable sequence into character sequence through an RNN-T-based syllable-to-character converter. Thus a rich text repository can be easily used to strengthen the language model ability. By introducing several important tricks, the cascade RNN-T approach surpasses the character-based RNN-T by a large margin on several Mandarin test sets, with much higher recognition quality and similar latency.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"25","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE Spoken Language Technology Workshop (SLT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SLT48900.2021.9383506","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 25

Abstract

End-to-end models are favored in automatic speech recognition (ASR) because of its simplified system structure and superior performance. Among these models, recurrent neural network transducer (RNN-T) has achieved significant progress in streaming on-device speech recognition because of its high-accuracy and low-latency. RNN-T adopts a prediction network to enhance language information, but its language modeling ability is limited because it still needs paired speech-text data to train. Further strengthening the language modeling ability through extra text data, such as shallow fusion with an external language model, only brings a small performance gain. In view of the fact that Mandarin Chinese is a character-based language and each character is pronounced as a tonal syllable, this paper proposes a novel cascade RNN-T approach to improve the language modeling ability of RNN-T. Our approach firstly uses an RNN-T to transform acoustic feature into syllable sequence, and then converts the syllable sequence into character sequence through an RNN-T-based syllable-to-character converter. Thus a rich text repository can be easily used to strengthen the language model ability. By introducing several important tricks, the cascade RNN-T approach surpasses the character-based RNN-T by a large margin on several Mandarin test sets, with much higher recognition quality and similar latency.

查看原文本刊更多论文

级联rnn -换能器:基于音节流的设备上普通话语音识别与音节-字符转换器

端到端模型以其简化的系统结构和优越的性能在自动语音识别(ASR)中受到青睐。在这些模型中，递归神经网络换能器(RNN-T)以其高精度和低延迟的特点在流设备上语音识别方面取得了重大进展。RNN-T采用预测网络增强语言信息，但由于仍然需要语音-文本配对数据进行训练，其语言建模能力有限。通过额外的文本数据进一步增强语言建模能力，例如与外部语言模型进行浅融合，只会带来很小的性能提升。针对普通话是一种基于字符的语言，每个字符的发音都是一个声调音节的特点，本文提出了一种新的级联RNN-T方法，以提高RNN-T的语言建模能力。该方法首先使用RNN-T将声学特征转换为音节序列，然后通过基于rnn的音节-字符转换器将音节序列转换为字符序列。因此，可以方便地使用富文本存储库来增强语言建模能力。通过引入几个重要的技巧，级联RNN-T方法在几个普通话测试集上大大超过了基于字符的RNN-T方法，具有更高的识别质量和相似的延迟。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE Spoken Language Technology Workshop (SLT)

自引率

0.00%

发文量