无并行语料库的数据驱动语音增强对抗训练

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI:10.1109/ASRU.2017.8268914

T. Higuchi, K. Kinoshita, Marc Delcroix, T. Nakatani

{"title":"无并行语料库的数据驱动语音增强对抗训练","authors":"T. Higuchi, K. Kinoshita, Marc Delcroix, T. Nakatani","doi":"10.1109/ASRU.2017.8268914","DOIUrl":null,"url":null,"abstract":"This paper describes a way of performing data-driven speech enhancement for noise robust automatic speech recognition (ASR), where we train a model for speech enhancement without a parallel corpus. Data-driven speech enhancement with deep models has recently been investigated and proven to be a promising approach for ASR. However, for model training, we need a parallel corpus consisting of noisy speech signals and corresponding clean speech signals for supervision. Therefore a deep model can be trained only with a simulated dataset, and we cannot take advantage of a large number of noisy recordings that do not have corresponding clean speech signals. As a first step towards model training without supervision, this paper proposes a novel approach introducing adversarial training for a time-frequency mask estimator. Our cost function for model training is defined by discriminators instead of by using the distance between the model outputs and the supervision. The discriminators distinguish between true signals and enhanced signals obtained with time-frequency masks estimated with a mask estimator. The mask estimator is trained to cheat the discriminators, which enables the mask estimator to estimate the appropriate time-frequency masks without a parallel corpus. The enhanced signal is finally obtained with masking-based beamforming. Experimental results show that, even without exploiting parallel data, our speech enhancement approach achieves improved ASR performance compared with results obtained with unprocessed signals and achieves comparable ASR performance to that obtained with a model trained with a parallel corpus based on a minimum mean squared error (MMSE) criterion.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"151 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"23","resultStr":"{\"title\":\"Adversarial training for data-driven speech enhancement without parallel corpus\",\"authors\":\"T. Higuchi, K. Kinoshita, Marc Delcroix, T. Nakatani\",\"doi\":\"10.1109/ASRU.2017.8268914\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper describes a way of performing data-driven speech enhancement for noise robust automatic speech recognition (ASR), where we train a model for speech enhancement without a parallel corpus. Data-driven speech enhancement with deep models has recently been investigated and proven to be a promising approach for ASR. However, for model training, we need a parallel corpus consisting of noisy speech signals and corresponding clean speech signals for supervision. Therefore a deep model can be trained only with a simulated dataset, and we cannot take advantage of a large number of noisy recordings that do not have corresponding clean speech signals. As a first step towards model training without supervision, this paper proposes a novel approach introducing adversarial training for a time-frequency mask estimator. Our cost function for model training is defined by discriminators instead of by using the distance between the model outputs and the supervision. The discriminators distinguish between true signals and enhanced signals obtained with time-frequency masks estimated with a mask estimator. The mask estimator is trained to cheat the discriminators, which enables the mask estimator to estimate the appropriate time-frequency masks without a parallel corpus. The enhanced signal is finally obtained with masking-based beamforming. Experimental results show that, even without exploiting parallel data, our speech enhancement approach achieves improved ASR performance compared with results obtained with unprocessed signals and achieves comparable ASR performance to that obtained with a model trained with a parallel corpus based on a minimum mean squared error (MMSE) criterion.\",\"PeriodicalId\":290868,\"journal\":{\"name\":\"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)\",\"volume\":\"151 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"23\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ASRU.2017.8268914\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASRU.2017.8268914","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 23

摘要

本文描述了一种对噪声鲁棒自动语音识别(ASR)进行数据驱动语音增强的方法，其中我们训练了一个没有并行语料库的语音增强模型。基于深度模型的数据驱动语音增强最近得到了研究，并被证明是一种很有前途的ASR方法。然而，对于模型训练，我们需要一个由噪声语音信号和相应的干净语音信号组成的并行语料库进行监督。因此，深度模型只能用模拟数据集来训练，我们无法利用大量没有相应干净语音信号的噪声录音。作为无监督模型训练的第一步，本文提出了一种引入时频掩膜估计器对抗训练的新方法。我们的模型训练成本函数是由鉴别器定义的，而不是使用模型输出和监督之间的距离。鉴别器区分真实信号和增强信号，这些信号是由掩模估计器估计的时频掩模得到的。通过训练掩模估计器来欺骗鉴别器，使得掩模估计器能够在不需要并行语料库的情况下估计出合适的时频掩模。最后利用掩模波束形成技术获得增强信号。实验结果表明，即使不利用并行数据，我们的语音增强方法也比未处理信号获得的结果更好，并且与基于最小均方误差(MMSE)标准的并行语料库训练的模型获得的ASR性能相当。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Adversarial training for data-driven speech enhancement without parallel corpus

This paper describes a way of performing data-driven speech enhancement for noise robust automatic speech recognition (ASR), where we train a model for speech enhancement without a parallel corpus. Data-driven speech enhancement with deep models has recently been investigated and proven to be a promising approach for ASR. However, for model training, we need a parallel corpus consisting of noisy speech signals and corresponding clean speech signals for supervision. Therefore a deep model can be trained only with a simulated dataset, and we cannot take advantage of a large number of noisy recordings that do not have corresponding clean speech signals. As a first step towards model training without supervision, this paper proposes a novel approach introducing adversarial training for a time-frequency mask estimator. Our cost function for model training is defined by discriminators instead of by using the distance between the model outputs and the supervision. The discriminators distinguish between true signals and enhanced signals obtained with time-frequency masks estimated with a mask estimator. The mask estimator is trained to cheat the discriminators, which enables the mask estimator to estimate the appropriate time-frequency masks without a parallel corpus. The enhanced signal is finally obtained with masking-based beamforming. Experimental results show that, even without exploiting parallel data, our speech enhancement approach achieves improved ASR performance compared with results obtained with unprocessed signals and achieves comparable ASR performance to that obtained with a model trained with a parallel corpus based on a minimum mean squared error (MMSE) criterion.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

自引率

0.00%

发文量