{"title":"Multi-task ensembles with teacher-student training","authors":"J. H. M. Wong, M. Gales","doi":"10.1109/ASRU.2017.8268920","DOIUrl":null,"url":null,"abstract":"Ensemble methods often yield significant gains for automatic speech recognition. One method to obtain a diverse ensemble is to separately train models with a range of context dependent targets, often implemented as state clusters. However, decoding the complete ensemble can be computationally expensive. To reduce this cost, the ensemble can be generated using a multi-task architecture. Here, the hidden layers are merged across all members of the ensemble, leaving only separate output layers for each set of targets. Previous investigations of this form of ensemble have used cross-entropy training, which is shown in this paper to produce only limited diversity between members of the ensemble. This paper extends the multi-task framework in several ways. First, the multi-task ensemble can be trained in a teacher-student fashion toward the ensemble of separate models, with the aim of increasing diversity. Second, the multi-task ensemble can be trained with a sequence discriminative criterion. Finally, a student model, with a single output layer, can be trained to emulate the combined ensemble, to further reduce the computational cost of decoding. These methods are evaluated on the Babel conversational telephone speech, AMI meeting transcription, and HUB4 English broadcast news tasks.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASRU.2017.8268920","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 14
Abstract
Ensemble methods often yield significant gains for automatic speech recognition. One method to obtain a diverse ensemble is to separately train models with a range of context dependent targets, often implemented as state clusters. However, decoding the complete ensemble can be computationally expensive. To reduce this cost, the ensemble can be generated using a multi-task architecture. Here, the hidden layers are merged across all members of the ensemble, leaving only separate output layers for each set of targets. Previous investigations of this form of ensemble have used cross-entropy training, which is shown in this paper to produce only limited diversity between members of the ensemble. This paper extends the multi-task framework in several ways. First, the multi-task ensemble can be trained in a teacher-student fashion toward the ensemble of separate models, with the aim of increasing diversity. Second, the multi-task ensemble can be trained with a sequence discriminative criterion. Finally, a student model, with a single output layer, can be trained to emulate the combined ensemble, to further reduce the computational cost of decoding. These methods are evaluated on the Babel conversational telephone speech, AMI meeting transcription, and HUB4 English broadcast news tasks.