基于UNet++的多通道语音去噪与远程语音识别

2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2021-01-24 DOI:10.1109/ISCSLP49672.2021.9362064

Tuo Zhao, Yunxin Zhao, Shaojun Wang, Mei Han

{"title":"基于UNet++的多通道语音去噪与远程语音识别","authors":"Tuo Zhao, Yunxin Zhao, Shaojun Wang, Mei Han","doi":"10.1109/ISCSLP49672.2021.9362064","DOIUrl":null,"url":null,"abstract":"We propose a novel approach of using a newly appeared fully convolutional network (FCN) architecture, UNet++, for multichannel speech dereverberation and distant speech recognition (DSR). While the previous FCN architecture UNet is good at utilizing time-frequency structures of speech, UNet++ offers better robustness in network depths and skip connections. For DSR, UNet++ serves as a feature enhancement front-end, and the enhanced speech features are used for acoustic model training and recognition. We also propose a frequency-dependent convolution scheme (FDCS), resulting in new variants of UNet and UNet++. We present DSR results from the multiple distant microphone (MDM) datasets of AMI meeting corpus, and compare the performance of UNet++ with UNet and weighted prediction error (WPE). Our results demonstrate that for DSR, the UNet++-based approaches provide large word error rate (WER) reductions over its UNetand WPE-based counterparts. The UNet++ with WPE preprocessing and 4-channel input achieves the lowest WERs. The dereverberation results are also measured by speech-to-dereverberation modulation energy ratio (SRMR), from which large gains of UNet++ over UNet and WPE are also observed.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"UNet++-Based Multi-Channel Speech Dereverberation and Distant Speech Recognition\",\"authors\":\"Tuo Zhao, Yunxin Zhao, Shaojun Wang, Mei Han\",\"doi\":\"10.1109/ISCSLP49672.2021.9362064\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We propose a novel approach of using a newly appeared fully convolutional network (FCN) architecture, UNet++, for multichannel speech dereverberation and distant speech recognition (DSR). While the previous FCN architecture UNet is good at utilizing time-frequency structures of speech, UNet++ offers better robustness in network depths and skip connections. For DSR, UNet++ serves as a feature enhancement front-end, and the enhanced speech features are used for acoustic model training and recognition. We also propose a frequency-dependent convolution scheme (FDCS), resulting in new variants of UNet and UNet++. We present DSR results from the multiple distant microphone (MDM) datasets of AMI meeting corpus, and compare the performance of UNet++ with UNet and weighted prediction error (WPE). Our results demonstrate that for DSR, the UNet++-based approaches provide large word error rate (WER) reductions over its UNetand WPE-based counterparts. The UNet++ with WPE preprocessing and 4-channel input achieves the lowest WERs. The dereverberation results are also measured by speech-to-dereverberation modulation energy ratio (SRMR), from which large gains of UNet++ over UNet and WPE are also observed.\",\"PeriodicalId\":279828,\"journal\":{\"name\":\"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)\",\"volume\":\"12 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-01-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISCSLP49672.2021.9362064\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISCSLP49672.2021.9362064","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

摘要

我们提出了一种使用新出现的全卷积网络(FCN)架构UNet++的新方法，用于多通道语音去噪和远程语音识别(DSR)。虽然以前的FCN架构UNet擅长利用语音的时频结构，但UNet++在网络深度和跳过连接方面提供了更好的鲁棒性。对于DSR, UNet++作为特征增强前端，增强的语音特征用于声学模型训练和识别。我们还提出了一种频率相关的卷积方案(FDCS)，从而产生了UNet和unet++的新变体。我们给出了AMI会议语料库的多个远程麦克风(MDM)数据集的DSR结果，并比较了UNet++与UNet和加权预测误差(WPE)的性能。我们的结果表明，对于DSR，基于UNet++的方法比基于UNet和wpe的方法提供了更大的单词错误率(WER)降低。采用WPE预处理和4通道输入的unet++实现了最低的wwe。通过语音-去噪调制能量比(SRMR)也测量了去噪结果，从中还观察到UNet++比UNet和WPE有较大的增益。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

UNet++-Based Multi-Channel Speech Dereverberation and Distant Speech Recognition

We propose a novel approach of using a newly appeared fully convolutional network (FCN) architecture, UNet++, for multichannel speech dereverberation and distant speech recognition (DSR). While the previous FCN architecture UNet is good at utilizing time-frequency structures of speech, UNet++ offers better robustness in network depths and skip connections. For DSR, UNet++ serves as a feature enhancement front-end, and the enhanced speech features are used for acoustic model training and recognition. We also propose a frequency-dependent convolution scheme (FDCS), resulting in new variants of UNet and UNet++. We present DSR results from the multiple distant microphone (MDM) datasets of AMI meeting corpus, and compare the performance of UNet++ with UNet and weighted prediction error (WPE). Our results demonstrate that for DSR, the UNet++-based approaches provide large word error rate (WER) reductions over its UNetand WPE-based counterparts. The UNet++ with WPE preprocessing and 4-channel input achieves the lowest WERs. The dereverberation results are also measured by speech-to-dereverberation modulation energy ratio (SRMR), from which large gains of UNet++ over UNet and WPE are also observed.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)

自引率

0.00%

发文量