Naoto Umezaki, Takumi Okubo, Hideyuki Watanabe, S. Katagiri, M. Ohsaki
{"title":"Minimum Classification Error Training with Speech Synthesis-Based Regularization for Speech Recognition","authors":"Naoto Umezaki, Takumi Okubo, Hideyuki Watanabe, S. Katagiri, M. Ohsaki","doi":"10.1145/3372806.3372819","DOIUrl":null,"url":null,"abstract":"To increase the utility of Regularization, which is a common framework for avoiding the underestimation of ideal Bayes error, for speech recognizer training, we propose a new classifier training concept that incorporates a regularization term that represents the speech synthesis ability of classifier parameters. To implement our new concept, we first introduce a speech recognizer that embeds Line Spectral Pairs-Conjugate Structure-Algebraic Code Excited Linear Prediction (LSP-CS-ACELP) in a Multi-Prototype State-Transition-Model (MP-STM) classifier, define a regularization term that represents the speech synthesis ability by the distance between a training sample and its nearest MP-STM word model, and formalize a new Minimum Classification Error (MCE) training method for jointly minimizing a conventional smooth classification error count loss and the newly defined regularization term. We evaluated the proposed training method in an isolated-word, closed-vocabulary, and speaker-independent speech recognition task whose Bayes error is estimated to be about 20% and found that our method successfully produced an estimate of Bayes error (about 18.4%) with a single training run over a training dataset without such data resampling as Cross-Validation or the assumptions of sample distribution. Moreover, we investigated the quality of the synthesized speech using LSP parameters derived from the trained prototypes and found that the quality of the Bayes error estimation is clearly supported by the speech synthesis ability preserved in the training.","PeriodicalId":340004,"journal":{"name":"International Conference on Signal Processing and Machine Learning","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Signal Processing and Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3372806.3372819","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
To increase the utility of Regularization, which is a common framework for avoiding the underestimation of ideal Bayes error, for speech recognizer training, we propose a new classifier training concept that incorporates a regularization term that represents the speech synthesis ability of classifier parameters. To implement our new concept, we first introduce a speech recognizer that embeds Line Spectral Pairs-Conjugate Structure-Algebraic Code Excited Linear Prediction (LSP-CS-ACELP) in a Multi-Prototype State-Transition-Model (MP-STM) classifier, define a regularization term that represents the speech synthesis ability by the distance between a training sample and its nearest MP-STM word model, and formalize a new Minimum Classification Error (MCE) training method for jointly minimizing a conventional smooth classification error count loss and the newly defined regularization term. We evaluated the proposed training method in an isolated-word, closed-vocabulary, and speaker-independent speech recognition task whose Bayes error is estimated to be about 20% and found that our method successfully produced an estimate of Bayes error (about 18.4%) with a single training run over a training dataset without such data resampling as Cross-Validation or the assumptions of sample distribution. Moreover, we investigated the quality of the synthesized speech using LSP parameters derived from the trained prototypes and found that the quality of the Bayes error estimation is clearly supported by the speech synthesis ability preserved in the training.