Joint separation and denoising of noisy multi-talker speech using recurrent neural networks and permutation invariant training

Morten Kolbæk, Dong Yu, Z. Tan, J. Jensen
{"title":"Joint separation and denoising of noisy multi-talker speech using recurrent neural networks and permutation invariant training","authors":"Morten Kolbæk, Dong Yu, Z. Tan, J. Jensen","doi":"10.1109/MLSP.2017.8168152","DOIUrl":null,"url":null,"abstract":"In this paper we propose to use utterance-level Permutation Invariant Training (uPIT) for speaker independent multi-talker speech separation and denoising, simultaneously. Specifically, we train deep bi-directional Long Short-Term Memory (LSTM) Recurrent Neural Networks (RNNs) using uPIT, for single-channel speaker independent multi-talker speech separation in multiple noisy conditions, including both synthetic and real-life noise signals. We focus our experiments on generalizability and noise robustness of models that rely on various types of a priori knowledge e.g. in terms of noise type and number of simultaneous speakers. We show that deep bi-directional LSTM RNNs trained using uPIT in noisy environments can improve the Signal-to-Distortion Ratio (SDR) as well as the Extended Short-Time Objective Intelligibility (ESTOI) measure, on the speaker independent multi-talker speech separation and denoising task, for various noise types and Signal-to-Noise Ratios (SNRs). Specifically, we first show that LSTM RNNs can achieve large SDR and ESTOI improvements, when evaluated using known noise types, and that a single model is capable of handling multiple noise types with only a slight decrease in performance. Furthermore, we show that a single LSTM RNN can handle both two-speaker and three-speaker noisy mixtures, without a priori knowledge about the exact number of speakers. Finally, we show that LSTM RNNs trained using uPIT generalize well to noise types not seen during training.","PeriodicalId":6542,"journal":{"name":"2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP)","volume":"89 1","pages":"1-6"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MLSP.2017.8168152","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 20

Abstract

In this paper we propose to use utterance-level Permutation Invariant Training (uPIT) for speaker independent multi-talker speech separation and denoising, simultaneously. Specifically, we train deep bi-directional Long Short-Term Memory (LSTM) Recurrent Neural Networks (RNNs) using uPIT, for single-channel speaker independent multi-talker speech separation in multiple noisy conditions, including both synthetic and real-life noise signals. We focus our experiments on generalizability and noise robustness of models that rely on various types of a priori knowledge e.g. in terms of noise type and number of simultaneous speakers. We show that deep bi-directional LSTM RNNs trained using uPIT in noisy environments can improve the Signal-to-Distortion Ratio (SDR) as well as the Extended Short-Time Objective Intelligibility (ESTOI) measure, on the speaker independent multi-talker speech separation and denoising task, for various noise types and Signal-to-Noise Ratios (SNRs). Specifically, we first show that LSTM RNNs can achieve large SDR and ESTOI improvements, when evaluated using known noise types, and that a single model is capable of handling multiple noise types with only a slight decrease in performance. Furthermore, we show that a single LSTM RNN can handle both two-speaker and three-speaker noisy mixtures, without a priori knowledge about the exact number of speakers. Finally, we show that LSTM RNNs trained using uPIT generalize well to noise types not seen during training.
基于循环神经网络和置换不变量训练的多话语音联合分离与去噪
在本文中,我们提出使用话语级排列不变性训练(uPIT)同时进行独立于说话人的多说话人语音分离和去噪。具体来说,我们使用uPIT训练深度双向长短期记忆(LSTM)递归神经网络(rnn),用于在多种噪声条件下(包括合成和现实噪声信号)进行单通道独立于扬声器的多讲话者语音分离。我们将实验重点放在依赖于各种类型的先验知识的模型的泛化性和噪声鲁棒性上,例如在噪声类型和同时说话者的数量方面。研究表明,在噪声环境中使用uPIT训练的深度双向LSTM rnn可以在不同噪声类型和信噪比(SNRs)下独立于说话者的多说话者语音分离和去噪任务上提高信失真比(SDR)和扩展短时客观可解度(ESTOI)度量。具体来说,我们首先表明,当使用已知的噪声类型进行评估时,LSTM rnn可以实现很大的SDR和ESTOI改进,并且单个模型能够处理多种噪声类型,而性能仅略有下降。此外,我们证明了单个LSTM RNN可以处理双扬声器和三扬声器的噪声混合,而无需先验地知道扬声器的确切数量。最后,我们证明了使用uPIT训练的LSTM rnn可以很好地泛化到训练过程中未见的噪声类型。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信