Att-TasNet: Attending to Encodings in Time-Domain Audio Speech Separation of Noisy, Reverberant Speech Mixtures

IF 1.3 Q3 ENGINEERING, ELECTRICAL & ELECTRONIC

Frontiers in signal processing Pub Date : 2022-05-11 DOI:10.3389/frsip.2022.856968

W. Ravenscroft, Stefan Goetze, Thomas Hain

{"title":"Att-TasNet: Attending to Encodings in Time-Domain Audio Speech Separation of Noisy, Reverberant Speech Mixtures","authors":"W. Ravenscroft, Stefan Goetze, Thomas Hain","doi":"10.3389/frsip.2022.856968","DOIUrl":null,"url":null,"abstract":"Separation of speech mixtures in noisy and reverberant environments remains a challenging task for state-of-the-art speech separation systems. Time-domain audio speech separation networks (TasNets) are among the most commonly used network architectures for this task. TasNet models have demonstrated strong performance on typical speech separation baselines where speech is not contaminated with noise. When additive or convolutive noise is present, performance of speech separation degrades significantly. TasNets are typically constructed of an encoder network, a mask estimation network and a decoder network. The design of these networks puts the majority of the onus for enhancing the signal on the mask estimation network when used without any pre-processing of the input data or post processing of the separation network output data. Use of multihead attention (MHA) is proposed in this work as an additional layer in the encoder and decoder to help the separation network attend to encoded features that are relevant to the target speakers and conversely suppress noisy disturbances in the encoded features. As shown in this work, incorporating MHA mechanisms into the encoder network in particular leads to a consistent performance improvement across numerous quality and intelligibility metrics on a variety of acoustic conditions using the WHAMR corpus, a data-set of noisy reverberant speech mixtures. The use of MHA is also investigated in the decoder network where it is demonstrated that smaller performance improvements are consistently gained within specific model configurations. The best performing MHA models yield a mean 0.6 dB scale invariant signal-to-distortion (SISDR) improvement on noisy reverberant mixtures over a baseline 1D convolution encoder. A mean 1 dB SISDR improvement is observed on clean speech mixtures.","PeriodicalId":93557,"journal":{"name":"Frontiers in signal processing","volume":"60 1","pages":""},"PeriodicalIF":1.3000,"publicationDate":"2022-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in signal processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/frsip.2022.856968","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 7

Abstract

Separation of speech mixtures in noisy and reverberant environments remains a challenging task for state-of-the-art speech separation systems. Time-domain audio speech separation networks (TasNets) are among the most commonly used network architectures for this task. TasNet models have demonstrated strong performance on typical speech separation baselines where speech is not contaminated with noise. When additive or convolutive noise is present, performance of speech separation degrades significantly. TasNets are typically constructed of an encoder network, a mask estimation network and a decoder network. The design of these networks puts the majority of the onus for enhancing the signal on the mask estimation network when used without any pre-processing of the input data or post processing of the separation network output data. Use of multihead attention (MHA) is proposed in this work as an additional layer in the encoder and decoder to help the separation network attend to encoded features that are relevant to the target speakers and conversely suppress noisy disturbances in the encoded features. As shown in this work, incorporating MHA mechanisms into the encoder network in particular leads to a consistent performance improvement across numerous quality and intelligibility metrics on a variety of acoustic conditions using the WHAMR corpus, a data-set of noisy reverberant speech mixtures. The use of MHA is also investigated in the decoder network where it is demonstrated that smaller performance improvements are consistently gained within specific model configurations. The best performing MHA models yield a mean 0.6 dB scale invariant signal-to-distortion (SISDR) improvement on noisy reverberant mixtures over a baseline 1D convolution encoder. A mean 1 dB SISDR improvement is observed on clean speech mixtures.

查看原文本刊更多论文

研究杂声混响混合语音的时域音频分离编码

在嘈杂和混响环境中分离混合语音仍然是一个具有挑战性的任务。时域音频语音分离网络(TasNets)是该任务中最常用的网络架构之一。TasNet模型在典型的语音分离基线上表现出很强的性能，其中语音没有被噪声污染。当存在加性噪声或卷积噪声时，语音分离性能明显下降。tasnet通常由编码器网络、掩码估计网络和解码器网络组成。在这些网络的设计中，在没有对输入数据进行预处理或对分离网络的输出数据进行后处理的情况下，大部分的信号增强工作都放在了掩码估计网络上。在这项工作中，我们提出使用多头注意(MHA)作为编码器和解码器的附加层，以帮助分离网络关注与目标说话者相关的编码特征，并反过来抑制编码特征中的噪声干扰。正如这项工作所示，特别是将MHA机制纳入编码器网络，使用WHAMR语料库(噪声混响语音混合数据集)，可以在各种声学条件下的许多质量和可理解性指标上实现一致的性能改进。在解码器网络中也研究了MHA的使用，其中证明了在特定模型配置中始终获得较小的性能改进。性能最好的MHA模型在基线1D卷积编码器上对噪声混响混合产生平均0.6 dB的尺度不变信号失真(SISDR)改进。在干净的语音混合情况下，SISDR平均提高了1 dB。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Frontiers in signal processing

自引率

0.00%

发文量