{"title":"On Temporal Context Information for Hybrid BLSTM-Based Phoneme Recognition","authors":"Timo Lohrenz, Maximilian Strake, T. Fingscheidt","doi":"10.1109/ASRU46091.2019.9003946","DOIUrl":null,"url":null,"abstract":"The modern approach to include long-term temporal context information into speech recognition systems is the use of recurrent neural networks, e.g., bi-directional long short-term memory (BLSTM) networks. In this paper, we decouple the BLSTM from a preceding CNN-based feature extractor network allowing us to investigate the use of temporal context in both models in a modular fashion. Accordingly, we train the BLSTMs on posteriors, stemming from preceding CNNs which use various amounts of limited context in their input layer, and investigate to what extent the BLSTM is able to effectively make use of its long-term modeling capabilities. We show that it is beneficial to train the BLSTM on posteriors stemming from a temporal context-free acoustic model. Remarkably, the best performing combination of CNN acoustic model and BLSTM afterwards is a large-context CNN (expected), followed by a BLSTM which has been trained on context-free CNN output posteriors (surprising).","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"382 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASRU46091.2019.9003946","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
The modern approach to include long-term temporal context information into speech recognition systems is the use of recurrent neural networks, e.g., bi-directional long short-term memory (BLSTM) networks. In this paper, we decouple the BLSTM from a preceding CNN-based feature extractor network allowing us to investigate the use of temporal context in both models in a modular fashion. Accordingly, we train the BLSTMs on posteriors, stemming from preceding CNNs which use various amounts of limited context in their input layer, and investigate to what extent the BLSTM is able to effectively make use of its long-term modeling capabilities. We show that it is beneficial to train the BLSTM on posteriors stemming from a temporal context-free acoustic model. Remarkably, the best performing combination of CNN acoustic model and BLSTM afterwards is a large-context CNN (expected), followed by a BLSTM which has been trained on context-free CNN output posteriors (surprising).