{"title":"UNet++-Based Multi-Channel Speech Dereverberation and Distant Speech Recognition","authors":"Tuo Zhao, Yunxin Zhao, Shaojun Wang, Mei Han","doi":"10.1109/ISCSLP49672.2021.9362064","DOIUrl":null,"url":null,"abstract":"We propose a novel approach of using a newly appeared fully convolutional network (FCN) architecture, UNet++, for multichannel speech dereverberation and distant speech recognition (DSR). While the previous FCN architecture UNet is good at utilizing time-frequency structures of speech, UNet++ offers better robustness in network depths and skip connections. For DSR, UNet++ serves as a feature enhancement front-end, and the enhanced speech features are used for acoustic model training and recognition. We also propose a frequency-dependent convolution scheme (FDCS), resulting in new variants of UNet and UNet++. We present DSR results from the multiple distant microphone (MDM) datasets of AMI meeting corpus, and compare the performance of UNet++ with UNet and weighted prediction error (WPE). Our results demonstrate that for DSR, the UNet++-based approaches provide large word error rate (WER) reductions over its UNetand WPE-based counterparts. The UNet++ with WPE preprocessing and 4-channel input achieves the lowest WERs. The dereverberation results are also measured by speech-to-dereverberation modulation energy ratio (SRMR), from which large gains of UNet++ over UNet and WPE are also observed.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISCSLP49672.2021.9362064","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8
Abstract
We propose a novel approach of using a newly appeared fully convolutional network (FCN) architecture, UNet++, for multichannel speech dereverberation and distant speech recognition (DSR). While the previous FCN architecture UNet is good at utilizing time-frequency structures of speech, UNet++ offers better robustness in network depths and skip connections. For DSR, UNet++ serves as a feature enhancement front-end, and the enhanced speech features are used for acoustic model training and recognition. We also propose a frequency-dependent convolution scheme (FDCS), resulting in new variants of UNet and UNet++. We present DSR results from the multiple distant microphone (MDM) datasets of AMI meeting corpus, and compare the performance of UNet++ with UNet and weighted prediction error (WPE). Our results demonstrate that for DSR, the UNet++-based approaches provide large word error rate (WER) reductions over its UNetand WPE-based counterparts. The UNet++ with WPE preprocessing and 4-channel input achieves the lowest WERs. The dereverberation results are also measured by speech-to-dereverberation modulation energy ratio (SRMR), from which large gains of UNet++ over UNet and WPE are also observed.