Scale-aware dual-branch complex convolutional recurrent network for monaural speech enhancement

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language Pub Date : 2024-01-13 DOI:10.1016/j.csl.2024.101618

Yihao Li , Meng Sun , Xiongwei Zhang , Hugo Van hamme

{"title":"Scale-aware dual-branch complex convolutional recurrent network for monaural speech enhancement","authors":"Yihao Li , Meng Sun , Xiongwei Zhang , Hugo Van hamme","doi":"10.1016/j.csl.2024.101618","DOIUrl":null,"url":null,"abstract":"<div>A key step to single channel speech enhancement is the orthogonal separation of speech and noise. In this paper, a dual branch complex convolutional recurrent network (DBCCRN) is proposed to separate the complex spectrograms of speech and noises simultaneously. To model both local and global information, we incorporate conformer modules into our network. The orthogonality of the outputs of the two branches can be improved by optimizing the Signal-to-Noise Ratio (SNR) related losses. However, we found the models trained by two existing versions of SI-SNRs will yield enhanced speech at a very different scale from that of its clean counterpart. SNR loss will lead to a shrink amplitude of enhanced speech as well. A solution to this problem is to simply normalize the output, but it only works for off-line processing, not for the streaming one. When streaming speech enhancement is required, the error scale will lead to the degradation of speech quality. From an analytical inspection of the weakness of the models trained by SNR and SI-SNR losses, a new loss function called scale-aware SNR (SA-SNR) is proposed to cope with the scale variations of the enhanced speech. SA-SNR improves over SI-SNR by introducing an extra regularization term that encourages the model to produce signals of similar scale as the input, which has little influence on the perceptual quality of the enhanced speech. In addition, the commonly used evaluation recipe for speech enhancement may not be sufficient to comprehensively reflect the performance of the speech enhancement methods using SI-SNR losses, where amplitude variations of input speech should be carefully considered. A new evaluation recipe called ScaleError is introduced. Experiments show that our proposed method improves over the existing baselines on the evaluation sets of the voice bank corpus, DEMAND and the Interspeech 2020 Deep Noise Suppression Challenge, by obtaining higher scores for PESQ, STOI, SSNR, CSIG, CBAK and COVL.</div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"86 ","pages":"Article 101618"},"PeriodicalIF":3.1000,"publicationDate":"2024-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230824000019","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

A key step to single channel speech enhancement is the orthogonal separation of speech and noise. In this paper, a dual branch complex convolutional recurrent network (DBCCRN) is proposed to separate the complex spectrograms of speech and noises simultaneously. To model both local and global information, we incorporate conformer modules into our network. The orthogonality of the outputs of the two branches can be improved by optimizing the Signal-to-Noise Ratio (SNR) related losses. However, we found the models trained by two existing versions of SI-SNRs will yield enhanced speech at a very different scale from that of its clean counterpart. SNR loss will lead to a shrink amplitude of enhanced speech as well. A solution to this problem is to simply normalize the output, but it only works for off-line processing, not for the streaming one. When streaming speech enhancement is required, the error scale will lead to the degradation of speech quality. From an analytical inspection of the weakness of the models trained by SNR and SI-SNR losses, a new loss function called scale-aware SNR (SA-SNR) is proposed to cope with the scale variations of the enhanced speech. SA-SNR improves over SI-SNR by introducing an extra regularization term that encourages the model to produce signals of similar scale as the input, which has little influence on the perceptual quality of the enhanced speech. In addition, the commonly used evaluation recipe for speech enhancement may not be sufficient to comprehensively reflect the performance of the speech enhancement methods using SI-SNR losses, where amplitude variations of input speech should be carefully considered. A new evaluation recipe called ScaleError is introduced. Experiments show that our proposed method improves over the existing baselines on the evaluation sets of the voice bank corpus, DEMAND and the Interspeech 2020 Deep Noise Suppression Challenge, by obtaining higher scores for PESQ, STOI, SSNR, CSIG, CBAK and COVL.

查看原文本刊更多论文

用于单声道语音增强的规模感知双分支复杂卷积递归网络

单通道语音增强的关键步骤是语音和噪声的正交分离。本文提出了一种双分支复杂卷积递归网络（DBCCRN），可同时分离语音和噪声的复杂频谱图。为了对局部和全局信息进行建模，我们在网络中加入了共形模块。通过优化信噪比（SNR）相关损失，可以改善两个分支输出的正交性。然而，我们发现，由现有的两个版本的 SI-SNR 训练出的模型所产生的增强语音，其规模与纯净语音的规模截然不同。信噪比损失也会导致增强语音的振幅缩小。解决这一问题的办法是简单地对输出进行归一化处理，但这只适用于离线处理，不适用于流式处理。当需要进行流式语音增强时，误差标度将导致语音质量下降。通过对用 SNR 和 SI-SNR 损失训练的模型的弱点进行分析，提出了一种新的损失函数，称为尺度感知 SNR（SA-SNR），以应对增强语音的尺度变化。与 SI-SNR 相比，SA-SNR 引入了一个额外的正则化项，鼓励模型产生与输入信号相似的信号，这对增强语音的感知质量影响很小。此外，常用的语音增强评估方法可能不足以全面反映使用 SI-SNR 损失的语音增强方法的性能，在这种情况下，应仔细考虑输入语音的振幅变化。我们引入了一种名为 ScaleError 的新评估方法。实验表明，在语音库语料、DEMAND 和 Interspeech 2020 深度噪声抑制挑战赛的评估集上，我们提出的方法在 PESQ、STOI、SSNR、CSIG、CBAK 和 COVL 方面获得了更高的分数，比现有的基线方法有所改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Speech and Language 工程技术-计算机：人工智能

CiteScore

11.30

自引率

4.70%

发文量

审稿时长

22.9 weeks

期刊介绍： Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.