Maximilian Strake, Pascal Behr, Timo Lohrenz, T. Fingscheidt
{"title":"鲁棒ASR声学建模的密度模型","authors":"Maximilian Strake, Pascal Behr, Timo Lohrenz, T. Fingscheidt","doi":"10.1109/SLT.2018.8639529","DOIUrl":null,"url":null,"abstract":"In recent years, robust automatic speech recognition (ASR) has greatly taken benefit from the use of neural networks for acoustic modeling, although performance still degrades in severe noise conditions. Based on the previous success of models using convolutional and subsequent bidirectional long short-term memory (BLSTM) layers in the same network, we propose to use a densely connected convolutional network (DenseNet) as the first part of such a model, while the second is a BLSTM network. A particular contribution of our work is that we modify the DenseNet topology to become a kind of feature extractor for the subsequent BLSTM network operating on whole speech utterances. We evaluate our model on the 6-channel task of CHiME-4, and are able to consistently outperform a top-performing baseline based on wide residual networks and BLSTMs providing a 2.4% relative WER reduction on the real test set.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Densenet Blstm for Acoustic Modeling in Robust ASR\",\"authors\":\"Maximilian Strake, Pascal Behr, Timo Lohrenz, T. Fingscheidt\",\"doi\":\"10.1109/SLT.2018.8639529\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In recent years, robust automatic speech recognition (ASR) has greatly taken benefit from the use of neural networks for acoustic modeling, although performance still degrades in severe noise conditions. Based on the previous success of models using convolutional and subsequent bidirectional long short-term memory (BLSTM) layers in the same network, we propose to use a densely connected convolutional network (DenseNet) as the first part of such a model, while the second is a BLSTM network. A particular contribution of our work is that we modify the DenseNet topology to become a kind of feature extractor for the subsequent BLSTM network operating on whole speech utterances. We evaluate our model on the 6-channel task of CHiME-4, and are able to consistently outperform a top-performing baseline based on wide residual networks and BLSTMs providing a 2.4% relative WER reduction on the real test set.\",\"PeriodicalId\":377307,\"journal\":{\"name\":\"2018 IEEE Spoken Language Technology Workshop (SLT)\",\"volume\":\"15 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE Spoken Language Technology Workshop (SLT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SLT.2018.8639529\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE Spoken Language Technology Workshop (SLT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SLT.2018.8639529","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Densenet Blstm for Acoustic Modeling in Robust ASR
In recent years, robust automatic speech recognition (ASR) has greatly taken benefit from the use of neural networks for acoustic modeling, although performance still degrades in severe noise conditions. Based on the previous success of models using convolutional and subsequent bidirectional long short-term memory (BLSTM) layers in the same network, we propose to use a densely connected convolutional network (DenseNet) as the first part of such a model, while the second is a BLSTM network. A particular contribution of our work is that we modify the DenseNet topology to become a kind of feature extractor for the subsequent BLSTM network operating on whole speech utterances. We evaluate our model on the 6-channel task of CHiME-4, and are able to consistently outperform a top-performing baseline based on wide residual networks and BLSTMs providing a 2.4% relative WER reduction on the real test set.