{"title":"Non-linear input transformations for discriminative HMMs","authors":"F. Johansen, M. H. Johnsen","doi":"10.1109/ICASSP.1994.389314","DOIUrl":null,"url":null,"abstract":"This paper deals with speaker-independent continuous speech recognition. Our approach is based on continuous density hidden Markov models with a non-linear input feature transformation performed by a multilayer perceptron. We discuss various optimisation criteria and provide results on a TIMIT phoneme recognition task, using single frame (mutual information or relative entropy) MMI embedded in Viterbi training, and a global MMI criterion. As expected, global MMI is found superior to the frame-based criterion for continuous recognition. We further observe that optimal sentence decoding is essential to achieve maximum recognition rate for models trained by global MMI. Finally, we find that the simple MLP input transformation, with five frames of context information, can increase the recognition rate significantly compared to just using delta parameters.<<ETX>>","PeriodicalId":290798,"journal":{"name":"Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1994-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICASSP.1994.389314","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 12
Abstract
This paper deals with speaker-independent continuous speech recognition. Our approach is based on continuous density hidden Markov models with a non-linear input feature transformation performed by a multilayer perceptron. We discuss various optimisation criteria and provide results on a TIMIT phoneme recognition task, using single frame (mutual information or relative entropy) MMI embedded in Viterbi training, and a global MMI criterion. As expected, global MMI is found superior to the frame-based criterion for continuous recognition. We further observe that optimal sentence decoding is essential to achieve maximum recognition rate for models trained by global MMI. Finally, we find that the simple MLP input transformation, with five frames of context information, can increase the recognition rate significantly compared to just using delta parameters.<>