Supervised speech separation combined with adaptive beamforming

Pub Date : 2022-11-01 DOI:10.1016/j.csl.2022.101409

Zoran Šarić , Miško Subotić , Ružica Bilibajkić , Marko Barjaktarović , Jasmina Stojanović

{"title":"Supervised speech separation combined with adaptive beamforming","authors":"Zoran Šarić , Miško Subotić , Ružica Bilibajkić , Marko Barjaktarović , Jasmina Stojanović","doi":"10.1016/j.csl.2022.101409","DOIUrl":null,"url":null,"abstract":"<div><p>Microphone arrays are a powerful tool for ambient noise suppression. A multi-channel minimum mean square error (MMSE) solution can be factorized into a minimum variance distortionless response beamformer (MVDR) followed by a single-channel Wiener post-filter. MVDR beamformer, as well as its equivalent form of generalized sidelobe canceller (GSC), often does not provide sufficient noise reduction due to its limited ability to reduce diffuse noise and reverberation. Steering and calibration errors also degrade the performance of both MVDR and GSC beamformers. Post-filter can be realized by any single-channel noise reduction method. A modern and promising approach for single-channel noise reduction is formulated as a supervised speech separation (SSS) in which a supervised learning algorithm, typically a deep neural network (DNN), is trained to learn a mapping from the noisy features to a time-frequency representation of the target of interest. In this paper, we combined SSS and adaptive beamforming approaches. Adaptive beamforming is realized by simplified GSC (S-GSC) whose equivalence with MVDR beamformer is also proved in the paper. In the proposed S-GSC beamformer, the conventional beamformer is replaced by the central microphone signal. Steering towards the target speaker needs no direction of arrival (DOA) estimation. Trained DNN of the SSS module estimates ideal ratio mask (IRM) which is used for adaptation of the blocking matrix, calibration of the microphones, adaptation for the adaptive noise canceller, and the post-filtering. The proposed method was tested on 720 utterances of the TIMIT database used as target speech. The reverberant room was simulated by acoustic impulse responses recorded in the real room. Performance analysis was carried out with PESQ, STOI, and SDR measures. The test results showed that the proposed combined method outperforms the individual SSS and S-GSC methods.</p></div>","PeriodicalId":72674,"journal":{"name":"","volume":"76 ","pages":"Article 101409"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230822000444","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Microphone arrays are a powerful tool for ambient noise suppression. A multi-channel minimum mean square error (MMSE) solution can be factorized into a minimum variance distortionless response beamformer (MVDR) followed by a single-channel Wiener post-filter. MVDR beamformer, as well as its equivalent form of generalized sidelobe canceller (GSC), often does not provide sufficient noise reduction due to its limited ability to reduce diffuse noise and reverberation. Steering and calibration errors also degrade the performance of both MVDR and GSC beamformers. Post-filter can be realized by any single-channel noise reduction method. A modern and promising approach for single-channel noise reduction is formulated as a supervised speech separation (SSS) in which a supervised learning algorithm, typically a deep neural network (DNN), is trained to learn a mapping from the noisy features to a time-frequency representation of the target of interest. In this paper, we combined SSS and adaptive beamforming approaches. Adaptive beamforming is realized by simplified GSC (S-GSC) whose equivalence with MVDR beamformer is also proved in the paper. In the proposed S-GSC beamformer, the conventional beamformer is replaced by the central microphone signal. Steering towards the target speaker needs no direction of arrival (DOA) estimation. Trained DNN of the SSS module estimates ideal ratio mask (IRM) which is used for adaptation of the blocking matrix, calibration of the microphones, adaptation for the adaptive noise canceller, and the post-filtering. The proposed method was tested on 720 utterances of the TIMIT database used as target speech. The reverberant room was simulated by acoustic impulse responses recorded in the real room. Performance analysis was carried out with PESQ, STOI, and SDR measures. The test results showed that the proposed combined method outperforms the individual SSS and S-GSC methods.

Abstract Image

查看原文本刊更多论文

结合自适应波束形成的监督语音分离

麦克风阵列是抑制环境噪声的有力工具。多通道最小均方误差(MMSE)解可以分解成最小方差无失真响应波束形成器(MVDR)，然后是单通道维纳后滤波器。MVDR波束形成器及其等效形式的广义旁瓣消除器(GSC)，由于其降低漫射噪声和混响的能力有限，通常不能提供足够的降噪。转向和校准误差也会降低MVDR和GSC波束形成器的性能。后滤波可以通过任何单通道降噪方法实现。一种现代且有前途的单通道降噪方法被表述为监督语音分离(SSS)，其中监督学习算法(通常是深度神经网络(DNN))被训练以学习从噪声特征到感兴趣目标的时频表示的映射。本文将SSS和自适应波束形成方法相结合。采用简化GSC (S-GSC)实现自适应波束形成，并证明了其与MVDR波束形成器的等价性。在本文提出的S-GSC波束形成器中，传统的波束形成器被中央传声器信号所取代。转向目标说话人不需要估计到达方向(DOA)。SSS模块的训练DNN估计理想比例掩模(IRM)，用于适应阻塞矩阵、校准麦克风、适应自适应噪声消除器和后滤波。以TIMIT数据库中的720个语音作为目标语音，对该方法进行了测试。利用在真实房间中记录的声脉冲响应来模拟混响室。采用PESQ、STOI和SDR指标进行绩效分析。实验结果表明，该方法优于单独的SSS和S-GSC方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文