Zhongdi Qu, Parisa Haghani, Eugene Weinstein, P. Moreno
{"title":"Syllable-based acoustic modeling with CTC-SMBR-LSTM","authors":"Zhongdi Qu, Parisa Haghani, Eugene Weinstein, P. Moreno","doi":"10.1109/ASRU.2017.8268932","DOIUrl":null,"url":null,"abstract":"We explore the feasibility of training long short-term memory (LSTM) recurrent neural networks (RNNs) with syllables, rather than phonemes, as outputs. Syllables are a natural choice of linguistic unit for modeling the acoustics of languages such as Mandarin Chinese, due to the inherent nature of the syllable as an elemental pronunciation construct and the limited size of the syllable set for such languages (around 1400 syllables for Mandarin). Our models are trained with Connectionist Temporal Classification (CTC) and state-level minimum Bayes risk (sMBR) loss using asynchronous stochastic gradient descent (ASGD) utilizing a parallel computation infrastructure for large-scale training. Our acoustic models operate on feature frames computed every 30ms, which makes them well suited for modeling syllables rather than phonemes, which can have a shorter duration. Additionally, when compared to wordlevel modeling, syllables have the advantage of avoiding out-of-vocabulary (OOV) model outputs. Our experiments on a Mandarin voice search task show that syllable-output models can perform better than context-independent (CI) phone-output models, and can give similar performance as our state-of-the-art context-dependent (CD) models. Additionally, decoding with syllable-output models is substantially faster than with CI models or with CD models. We demonstrate that these improvements are maintained when the model is trained to recognize both Mandarin syllables and English phonemes.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"24","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASRU.2017.8268932","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 24
Abstract
We explore the feasibility of training long short-term memory (LSTM) recurrent neural networks (RNNs) with syllables, rather than phonemes, as outputs. Syllables are a natural choice of linguistic unit for modeling the acoustics of languages such as Mandarin Chinese, due to the inherent nature of the syllable as an elemental pronunciation construct and the limited size of the syllable set for such languages (around 1400 syllables for Mandarin). Our models are trained with Connectionist Temporal Classification (CTC) and state-level minimum Bayes risk (sMBR) loss using asynchronous stochastic gradient descent (ASGD) utilizing a parallel computation infrastructure for large-scale training. Our acoustic models operate on feature frames computed every 30ms, which makes them well suited for modeling syllables rather than phonemes, which can have a shorter duration. Additionally, when compared to wordlevel modeling, syllables have the advantage of avoiding out-of-vocabulary (OOV) model outputs. Our experiments on a Mandarin voice search task show that syllable-output models can perform better than context-independent (CI) phone-output models, and can give similar performance as our state-of-the-art context-dependent (CD) models. Additionally, decoding with syllable-output models is substantially faster than with CI models or with CD models. We demonstrate that these improvements are maintained when the model is trained to recognize both Mandarin syllables and English phonemes.