IKDMM

Proceedings of the ACM Multimedia Asia Pub Date : 2019-12-15 DOI:10.1145/3338533.3366607

Zhaoyi Liu, Yuexian Zou

{"title":"IKDMM","authors":"Zhaoyi Liu, Yuexian Zou","doi":"10.1145/3338533.3366607","DOIUrl":null,"url":null,"abstract":"Microphone array beamforming has been approved to be an effective method for suppressing adverse interferences. Recently, acoustic beamformers that employ neural networks (NN) for estimating the time-frequency (T-F) mask, termed as TFMask-BF, receive tremendous attention and have shown great benefits as a front-end for noise-robust Automatic Speech Recognition (ASR). However, our preliminary experiments using TFMask-BF for ASR task show that the mask model trained with simulated data cannot perform well in the real environment since there is a data mismatch problem. In this study, we adopt the knowledge distillation learning framework to make use of real-recording data together with simulated data in the training phase to reduce the impact of the data mismatch. Moreover, a novel iterative knowledge distillation mask model (IKDMM) training scheme has been systematically developed. Specifically, two bi-directional long short-term memory (BLSTM) models, are designed as a teacher mask model (TMM) and a student mask model (SMM). The TMM is trained with simulated data at each iteration and then it is employed to separately generate the soft mask labels of both simulated and real-recording data.The simulated data and the real-recording data with their corresponding generated soft mask labels are formed into the new training data to train our SMM at each iteration. The proposed approach is evaluated as a front-end for ASR on the six-channel CHiME-4 corpus. Experimental results show that the data mismatch problem can be reduced by our IKDMM, leading to a 5% relative Word Error Rate (WER) reduction compared to conventional TFMask-BF for the real-recording data under noisy conditions.","PeriodicalId":273086,"journal":{"name":"Proceedings of the ACM Multimedia Asia","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM Multimedia Asia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3338533.3366607","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Microphone array beamforming has been approved to be an effective method for suppressing adverse interferences. Recently, acoustic beamformers that employ neural networks (NN) for estimating the time-frequency (T-F) mask, termed as TFMask-BF, receive tremendous attention and have shown great benefits as a front-end for noise-robust Automatic Speech Recognition (ASR). However, our preliminary experiments using TFMask-BF for ASR task show that the mask model trained with simulated data cannot perform well in the real environment since there is a data mismatch problem. In this study, we adopt the knowledge distillation learning framework to make use of real-recording data together with simulated data in the training phase to reduce the impact of the data mismatch. Moreover, a novel iterative knowledge distillation mask model (IKDMM) training scheme has been systematically developed. Specifically, two bi-directional long short-term memory (BLSTM) models, are designed as a teacher mask model (TMM) and a student mask model (SMM). The TMM is trained with simulated data at each iteration and then it is employed to separately generate the soft mask labels of both simulated and real-recording data.The simulated data and the real-recording data with their corresponding generated soft mask labels are formed into the new training data to train our SMM at each iteration. The proposed approach is evaluated as a front-end for ASR on the six-channel CHiME-4 corpus. Experimental results show that the data mismatch problem can be reduced by our IKDMM, leading to a 5% relative Word Error Rate (WER) reduction compared to conventional TFMask-BF for the real-recording data under noisy conditions.

查看原文本刊更多论文

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the ACM Multimedia Asia

自引率

0.00%

发文量