{"title":"I-Vector estimation as auxiliary task for Multi-Task Learning based acoustic modeling for automatic speech recognition","authors":"Gueorgui Pironkov, S. Dupont, T. Dutoit","doi":"10.1109/SLT.2016.7846237","DOIUrl":null,"url":null,"abstract":"I-Vectors have been successfully applied in the speaker identification community in order to characterize the speaker and its acoustic environment. Recently, i-vectors have also shown their usefulness in automatic speech recognition, when concatenated to standard acoustic features. Instead of directly feeding the acoustic model with i-vectors, we here investigate a Multi-Task Learning approach, where a neural network is trained to simultaneously recognize the phone-state posterior probabilities and extract i-vectors, using the standard acoustic features. Multi-Task Learning is a regularization method which aims at improving the network's generalization ability, by training a unique network to solve several different, but related tasks. The core idea of using i-vector extraction as an auxiliary task is to give the network an additional inter-speaker awareness, and thus, reduce overfitting. Overfitting is a commonly met issue in speech recognition and is especially impacting when the amount of training data is limited. The proposed setup is trained and tested on the TIMIT database, while the acoustic modeling is performed using a Recurrent Neural Network with Long Short-Term Memory cells.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE Spoken Language Technology Workshop (SLT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SLT.2016.7846237","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
I-Vectors have been successfully applied in the speaker identification community in order to characterize the speaker and its acoustic environment. Recently, i-vectors have also shown their usefulness in automatic speech recognition, when concatenated to standard acoustic features. Instead of directly feeding the acoustic model with i-vectors, we here investigate a Multi-Task Learning approach, where a neural network is trained to simultaneously recognize the phone-state posterior probabilities and extract i-vectors, using the standard acoustic features. Multi-Task Learning is a regularization method which aims at improving the network's generalization ability, by training a unique network to solve several different, but related tasks. The core idea of using i-vector extraction as an auxiliary task is to give the network an additional inter-speaker awareness, and thus, reduce overfitting. Overfitting is a commonly met issue in speech recognition and is especially impacting when the amount of training data is limited. The proposed setup is trained and tested on the TIMIT database, while the acoustic modeling is performed using a Recurrent Neural Network with Long Short-Term Memory cells.