{"title":"基于知识蒸馏和广义互相关特征的多通道ASR","authors":"Wenjie Li, Yu Zhang, Pengyuan Zhang, Fengpei Ge","doi":"10.1109/SLT.2018.8639600","DOIUrl":null,"url":null,"abstract":"Multi-channel signal processing techniques have played an important role in the far-field automatic speech recognition (ASR) as the separate front-end enhancement part. However, they often meet the mismatch problem. In this paper, we proposed a novel architecture of acoustic model, in which the multi-channel speech without preprocessing was utilized directly. Besides the strategy of knowledge distillation and the generalized cross correlation (GCC) adaptation were employed. We use knowledge distillation to transfer knowledge from a well-trained close-talking model to distant-talking scenarios in every frame of the multichannel distant speech. Moreover, the GCC between microphones, which contains the spatial information, is supplied as an auxiliary input to the neural network. We observe good compensation of those two techniques. Evaluated with the AMI and ICSI meeting corpora, the proposed methods achieve relative WER improvement of 7.7% and 7.5% over the model trained directly on the concatenated multi-channel speech.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Multichannel ASR with Knowledge Distillation and Generalized Cross Correlation Feature\",\"authors\":\"Wenjie Li, Yu Zhang, Pengyuan Zhang, Fengpei Ge\",\"doi\":\"10.1109/SLT.2018.8639600\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Multi-channel signal processing techniques have played an important role in the far-field automatic speech recognition (ASR) as the separate front-end enhancement part. However, they often meet the mismatch problem. In this paper, we proposed a novel architecture of acoustic model, in which the multi-channel speech without preprocessing was utilized directly. Besides the strategy of knowledge distillation and the generalized cross correlation (GCC) adaptation were employed. We use knowledge distillation to transfer knowledge from a well-trained close-talking model to distant-talking scenarios in every frame of the multichannel distant speech. Moreover, the GCC between microphones, which contains the spatial information, is supplied as an auxiliary input to the neural network. We observe good compensation of those two techniques. Evaluated with the AMI and ICSI meeting corpora, the proposed methods achieve relative WER improvement of 7.7% and 7.5% over the model trained directly on the concatenated multi-channel speech.\",\"PeriodicalId\":377307,\"journal\":{\"name\":\"2018 IEEE Spoken Language Technology Workshop (SLT)\",\"volume\":\"12 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE Spoken Language Technology Workshop (SLT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SLT.2018.8639600\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE Spoken Language Technology Workshop (SLT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SLT.2018.8639600","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Multichannel ASR with Knowledge Distillation and Generalized Cross Correlation Feature
Multi-channel signal processing techniques have played an important role in the far-field automatic speech recognition (ASR) as the separate front-end enhancement part. However, they often meet the mismatch problem. In this paper, we proposed a novel architecture of acoustic model, in which the multi-channel speech without preprocessing was utilized directly. Besides the strategy of knowledge distillation and the generalized cross correlation (GCC) adaptation were employed. We use knowledge distillation to transfer knowledge from a well-trained close-talking model to distant-talking scenarios in every frame of the multichannel distant speech. Moreover, the GCC between microphones, which contains the spatial information, is supplied as an auxiliary input to the neural network. We observe good compensation of those two techniques. Evaluated with the AMI and ICSI meeting corpora, the proposed methods achieve relative WER improvement of 7.7% and 7.5% over the model trained directly on the concatenated multi-channel speech.