Mingkun Huang, Yizhou Lu, Lan Wang, Y. Qian, Kai Yu
{"title":"Exploring Model Units and Training Strategies for End-to-End Speech Recognition","authors":"Mingkun Huang, Yizhou Lu, Lan Wang, Y. Qian, Kai Yu","doi":"10.1109/ASRU46091.2019.9003834","DOIUrl":null,"url":null,"abstract":"In this work, we explore end-to-end speech recognition models (CTC, RNN-Transducer and attention-based models) with different model units (character, wordpiece and word) and various training strategies. We show that wordpiece unit outperforms character unit for all end-to-end systems on the Switchboard Hub5'00 benchmark. To improve the performance of end-to-end systems, we propose a multi-stage pretraining strategy, which gives 25.0% and 18.0% relative improvements over training from scratch for attention and RNN-T models respectively with wordpiece units. We achieve state-of-the-art performance on the Switchboard+Fisher-2000h task, outperforming all prior work. Together with other training strategies such as label smoothing and data augmentation, we achieve 5.9%/12.1% WER on the Switch-board/CallHome test set without using any external language models. This is a new performance milestone for a single end-to-end system, and it is also much better than the previous published best hybrid system, which is 6.7%/12.5% on each set individually.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"1 5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASRU46091.2019.9003834","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8
Abstract
In this work, we explore end-to-end speech recognition models (CTC, RNN-Transducer and attention-based models) with different model units (character, wordpiece and word) and various training strategies. We show that wordpiece unit outperforms character unit for all end-to-end systems on the Switchboard Hub5'00 benchmark. To improve the performance of end-to-end systems, we propose a multi-stage pretraining strategy, which gives 25.0% and 18.0% relative improvements over training from scratch for attention and RNN-T models respectively with wordpiece units. We achieve state-of-the-art performance on the Switchboard+Fisher-2000h task, outperforming all prior work. Together with other training strategies such as label smoothing and data augmentation, we achieve 5.9%/12.1% WER on the Switch-board/CallHome test set without using any external language models. This is a new performance milestone for a single end-to-end system, and it is also much better than the previous published best hybrid system, which is 6.7%/12.5% on each set individually.