Bayesian Multichannel Speech Enhancement with a Deep Speech Prior

2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) Pub Date : 2018-11-01 DOI:10.23919/APSIPA.2018.8659591

Kouhei Sekiguchi, Yoshiaki Bando, Kazuyoshi Yoshii, Tatsuya Kawahara

{"title":"Bayesian Multichannel Speech Enhancement with a Deep Speech Prior","authors":"Kouhei Sekiguchi, Yoshiaki Bando, Kazuyoshi Yoshii, Tatsuya Kawahara","doi":"10.23919/APSIPA.2018.8659591","DOIUrl":null,"url":null,"abstract":"This paper describes statistical multichannel speech enhancement based on a deep generative model of speech spectra. Recently, deep neural networks (DNNs) have widely been used for converting noisy speech spectra to clean speech spectra or estimating time-frequency masks. Such a supervised approach, however, requires a sufficient amount of training data (pairs of noisy speech data and clean speech data) and often fails in an unseen noisy environment. This calls for a blind source separation method called multichannel nonnegative matrix factorization (MNMF) that can jointly estimate low-rank source spectra and spatial covariances on the fly. However, the assumption of low-rankness does not hold true for speech spectra. To solve these problems, we propose a semi-supervised method based on an extension of MNMF that consists of a deep generative model for speech spectra and a standard low-rank model for noise spectra. The speech model can be trained in advance with auto-encoding variational Bayes (AEVB) by using only clean speech data and is used as a prior of clean speech spectra for speech enhancement. Given noisy speech spectrogram, we estimate the posterior of clean speech spectra while estimating the noise model on the fly. Such adaptive estimation is achieved by using Gibbs sampling in a unified Bayesian framework. The experimental results showed the potential of the proposed method.","PeriodicalId":287799,"journal":{"name":"2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"32","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/APSIPA.2018.8659591","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 32

Abstract

This paper describes statistical multichannel speech enhancement based on a deep generative model of speech spectra. Recently, deep neural networks (DNNs) have widely been used for converting noisy speech spectra to clean speech spectra or estimating time-frequency masks. Such a supervised approach, however, requires a sufficient amount of training data (pairs of noisy speech data and clean speech data) and often fails in an unseen noisy environment. This calls for a blind source separation method called multichannel nonnegative matrix factorization (MNMF) that can jointly estimate low-rank source spectra and spatial covariances on the fly. However, the assumption of low-rankness does not hold true for speech spectra. To solve these problems, we propose a semi-supervised method based on an extension of MNMF that consists of a deep generative model for speech spectra and a standard low-rank model for noise spectra. The speech model can be trained in advance with auto-encoding variational Bayes (AEVB) by using only clean speech data and is used as a prior of clean speech spectra for speech enhancement. Given noisy speech spectrogram, we estimate the posterior of clean speech spectra while estimating the noise model on the fly. Such adaptive estimation is achieved by using Gibbs sampling in a unified Bayesian framework. The experimental results showed the potential of the proposed method.

查看原文本刊更多论文

基于深度语音先验的贝叶斯多通道语音增强

本文描述了基于语音谱深度生成模型的统计多通道语音增强。近年来，深度神经网络(dnn)被广泛应用于将噪声语音频谱转换为干净语音频谱或估计时频掩模。然而，这种有监督的方法需要足够数量的训练数据(有噪声的语音数据对和干净的语音数据对)，并且经常在看不见的有噪声环境中失败。这需要一种称为多通道非负矩阵分解(MNMF)的盲源分离方法，该方法可以实时联合估计低秩源光谱和空间协方差。然而，低秩假设并不适用于语音谱。为了解决这些问题，我们提出了一种基于MNMF扩展的半监督方法，该方法由语音谱的深度生成模型和噪声谱的标准低秩模型组成。该模型可以仅使用干净的语音数据，利用自编码变分贝叶斯算法(AEVB)对语音模型进行预先训练，并作为干净语音谱的先验，用于语音增强。给定噪声语音谱图，我们在动态估计噪声模型的同时估计干净语音谱的后验。这种自适应估计是通过在统一的贝叶斯框架中使用吉布斯采样来实现的。实验结果表明了该方法的潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

自引率

0.00%

发文量