Sheng Li, Xugang Lu, R. Takashima, Peng Shen, Tatsuya Kawahara, H. Kawai
{"title":"Improving Very Deep Time-Delay Neural Network With Vertical-Attention For Effectively Training CTC-Based ASR Systems","authors":"Sheng Li, Xugang Lu, R. Takashima, Peng Shen, Tatsuya Kawahara, H. Kawai","doi":"10.1109/SLT.2018.8639675","DOIUrl":null,"url":null,"abstract":"The very deep neural network has recently been proposed for speech recognition and achieves significant performance. It has excellent potential for integration with end-to-end (E2E) training. Connectionist temporal classification (CTC) has shown great potential in E2E acoustic modeling. In this study, we investigate deep architectures and techniques which are suitable for CTC-based acoustic modeling. We propose a very deep residual time-delay CTC neural network (VResTD-CTC). How to select a suitable deep architecture optimized with the CTC objective function is crucial for obtaining the state of the art performance. Excellent performances can be obtained by selecting deep architecture for non-E2E ASR systems modeling with tied-triphone states. However, these optimized structures do not guarantee to achieve better or comparable performances on E2E (e.g., CTC-based) systems modeling with dynamic acoustic units. For solving this problem and further leveraging the system performance, we introduce the vertical-attention mechanism to reweight the residual blocks at each time step. Speech recognition experiments show our proposed model significantly outperforms the DNN and LSTM-based (both bidirectional and unidirectional) CTC baseline models.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE Spoken Language Technology Workshop (SLT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SLT.2018.8639675","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
The very deep neural network has recently been proposed for speech recognition and achieves significant performance. It has excellent potential for integration with end-to-end (E2E) training. Connectionist temporal classification (CTC) has shown great potential in E2E acoustic modeling. In this study, we investigate deep architectures and techniques which are suitable for CTC-based acoustic modeling. We propose a very deep residual time-delay CTC neural network (VResTD-CTC). How to select a suitable deep architecture optimized with the CTC objective function is crucial for obtaining the state of the art performance. Excellent performances can be obtained by selecting deep architecture for non-E2E ASR systems modeling with tied-triphone states. However, these optimized structures do not guarantee to achieve better or comparable performances on E2E (e.g., CTC-based) systems modeling with dynamic acoustic units. For solving this problem and further leveraging the system performance, we introduce the vertical-attention mechanism to reweight the residual blocks at each time step. Speech recognition experiments show our proposed model significantly outperforms the DNN and LSTM-based (both bidirectional and unidirectional) CTC baseline models.