Syllable-based acoustic modeling with CTC-SMBR-LSTM

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI:10.1109/ASRU.2017.8268932

Zhongdi Qu, Parisa Haghani, Eugene Weinstein, P. Moreno

{"title":"Syllable-based acoustic modeling with CTC-SMBR-LSTM","authors":"Zhongdi Qu, Parisa Haghani, Eugene Weinstein, P. Moreno","doi":"10.1109/ASRU.2017.8268932","DOIUrl":null,"url":null,"abstract":"We explore the feasibility of training long short-term memory (LSTM) recurrent neural networks (RNNs) with syllables, rather than phonemes, as outputs. Syllables are a natural choice of linguistic unit for modeling the acoustics of languages such as Mandarin Chinese, due to the inherent nature of the syllable as an elemental pronunciation construct and the limited size of the syllable set for such languages (around 1400 syllables for Mandarin). Our models are trained with Connectionist Temporal Classification (CTC) and state-level minimum Bayes risk (sMBR) loss using asynchronous stochastic gradient descent (ASGD) utilizing a parallel computation infrastructure for large-scale training. Our acoustic models operate on feature frames computed every 30ms, which makes them well suited for modeling syllables rather than phonemes, which can have a shorter duration. Additionally, when compared to wordlevel modeling, syllables have the advantage of avoiding out-of-vocabulary (OOV) model outputs. Our experiments on a Mandarin voice search task show that syllable-output models can perform better than context-independent (CI) phone-output models, and can give similar performance as our state-of-the-art context-dependent (CD) models. Additionally, decoding with syllable-output models is substantially faster than with CI models or with CD models. We demonstrate that these improvements are maintained when the model is trained to recognize both Mandarin syllables and English phonemes.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"24","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASRU.2017.8268932","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 24

Abstract

We explore the feasibility of training long short-term memory (LSTM) recurrent neural networks (RNNs) with syllables, rather than phonemes, as outputs. Syllables are a natural choice of linguistic unit for modeling the acoustics of languages such as Mandarin Chinese, due to the inherent nature of the syllable as an elemental pronunciation construct and the limited size of the syllable set for such languages (around 1400 syllables for Mandarin). Our models are trained with Connectionist Temporal Classification (CTC) and state-level minimum Bayes risk (sMBR) loss using asynchronous stochastic gradient descent (ASGD) utilizing a parallel computation infrastructure for large-scale training. Our acoustic models operate on feature frames computed every 30ms, which makes them well suited for modeling syllables rather than phonemes, which can have a shorter duration. Additionally, when compared to wordlevel modeling, syllables have the advantage of avoiding out-of-vocabulary (OOV) model outputs. Our experiments on a Mandarin voice search task show that syllable-output models can perform better than context-independent (CI) phone-output models, and can give similar performance as our state-of-the-art context-dependent (CD) models. Additionally, decoding with syllable-output models is substantially faster than with CI models or with CD models. We demonstrate that these improvements are maintained when the model is trained to recognize both Mandarin syllables and English phonemes.

查看原文本刊更多论文

基于CTC-SMBR-LSTM的音节声学建模

我们探索了用音节而不是音素作为输出来训练长短期记忆(LSTM)循环神经网络(rnn)的可行性。音节是一种自然选择的语言单位，用于模拟像普通话这样的语言的声学，因为音节作为基本发音结构的固有性质以及这些语言的音节集的有限大小(普通话大约1400个音节)。我们的模型使用连接时间分类(CTC)和使用异步随机梯度下降(ASGD)的状态级最小贝叶斯风险(sMBR)损失进行训练，利用并行计算基础设施进行大规模训练。我们的声学模型对每30毫秒计算一次的特征帧进行操作，这使得它们非常适合对音节而不是音素进行建模，因为音素的持续时间更短。此外，与词级建模相比，音节具有避免词汇外(OOV)模型输出的优势。我们在普通话语音搜索任务上的实验表明，音节输出模型比上下文无关(CI)电话输出模型表现得更好，并且可以提供与我们最先进的上下文相关(CD)模型相似的性能。此外，使用音节输出模型进行解码比使用CI模型或CD模型要快得多。我们证明，当模型被训练来识别普通话音节和英语音素时，这些改进是保持的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

自引率

0.00%

发文量