MFFR-net: Multi-scale feature fusion and attentive recalibration network for deep neural speech enhancement

IF 2.9 3区 工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC
Nasir Saleem , Sami Bourouis
{"title":"MFFR-net: Multi-scale feature fusion and attentive recalibration network for deep neural speech enhancement","authors":"Nasir Saleem ,&nbsp;Sami Bourouis","doi":"10.1016/j.dsp.2024.104870","DOIUrl":null,"url":null,"abstract":"<div><div>Deep neural networks (DNNs) have been successfully applied in advancing speech enhancement (SE), particularly in overcoming the challenges posed by nonstationary noisy backgrounds. In this context, multi-scale feature fusion and recalibration (MFFR) can improve speech enhancement performance by combining multi-scale and recalibrated features. This paper proposes a speech enhancement system that capitalizes on a large-scale pre-trained model, seamlessly fused with features attentively recalibrated using varying kernel sizes in convolutional layers. This process enables the SE system to capture features across diverse scales, enhancing its overall performance. The proposed SE system uses a transferable features extractor architecture and integrates with multi-scaled attentively recalibrated features. Utilizing 2D-convolutional layers, the convolutional encoder-decoder extracts both local and contextual features from speech signals. To capture long-term temporal dependencies, a bidirectional simple recurrent unit (BSRU) serves as a bottleneck layer positioned between the encoder and decoder. The experiments are conducted on three publicly available datasets including Texas Instruments/Massachusetts Institute of Technology (TIMIT), LibriSpeech, and Voice Cloning Toolkit+Diverse Environments Multi-channel Acoustic Noise Database (VCTK+DEMAND). The experimental results show that the proposed SE system performs better than several recent approaches on the Short-Time Objective Intelligibility (STOI) and Perceptual Evaluation of Speech Quality (PESQ) evaluation metrics. On the TIMIT dataset, the proposed system showcases a considerable improvement in STOI (17.3%) and PESQ (0.74) over the noisy mixture. The evaluation on the LibriSpeech dataset yields results with a 17.6% and 0.87 improvement in STOI and PESQ.</div></div>","PeriodicalId":51011,"journal":{"name":"Digital Signal Processing","volume":"156 ","pages":"Article 104870"},"PeriodicalIF":2.9000,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1051200424004949","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

Abstract

Deep neural networks (DNNs) have been successfully applied in advancing speech enhancement (SE), particularly in overcoming the challenges posed by nonstationary noisy backgrounds. In this context, multi-scale feature fusion and recalibration (MFFR) can improve speech enhancement performance by combining multi-scale and recalibrated features. This paper proposes a speech enhancement system that capitalizes on a large-scale pre-trained model, seamlessly fused with features attentively recalibrated using varying kernel sizes in convolutional layers. This process enables the SE system to capture features across diverse scales, enhancing its overall performance. The proposed SE system uses a transferable features extractor architecture and integrates with multi-scaled attentively recalibrated features. Utilizing 2D-convolutional layers, the convolutional encoder-decoder extracts both local and contextual features from speech signals. To capture long-term temporal dependencies, a bidirectional simple recurrent unit (BSRU) serves as a bottleneck layer positioned between the encoder and decoder. The experiments are conducted on three publicly available datasets including Texas Instruments/Massachusetts Institute of Technology (TIMIT), LibriSpeech, and Voice Cloning Toolkit+Diverse Environments Multi-channel Acoustic Noise Database (VCTK+DEMAND). The experimental results show that the proposed SE system performs better than several recent approaches on the Short-Time Objective Intelligibility (STOI) and Perceptual Evaluation of Speech Quality (PESQ) evaluation metrics. On the TIMIT dataset, the proposed system showcases a considerable improvement in STOI (17.3%) and PESQ (0.74) over the noisy mixture. The evaluation on the LibriSpeech dataset yields results with a 17.6% and 0.87 improvement in STOI and PESQ.
MFFR-net:用于深度神经语音增强的多尺度特征融合和注意力重新校准网络
深度神经网络(DNN)已成功应用于语音增强(SE),尤其是在克服非稳态噪声背景带来的挑战方面。在这种情况下,多尺度特征融合和重新校准(MFFR)可通过结合多尺度和重新校准特征来提高语音增强性能。本文提出了一种语音增强系统,该系统利用大规模预训练模型,与卷积层中使用不同核大小重新校准的特征进行无缝融合。这一过程使 SE 系统能够捕捉不同尺度的特征,从而提高其整体性能。拟议的 SE 系统采用了可转移的特征提取器架构,并与多尺度的专心重新校准特征相结合。利用二维卷积层,卷积编码器-解码器可从语音信号中提取局部和上下文特征。为了捕捉长期的时间依赖性,双向简单递归单元(BSRU)作为瓶颈层位于编码器和解码器之间。实验在三个公开数据集上进行,包括德州仪器/麻省理工学院(TIMIT)、LibriSpeech 和 Voice Cloning Toolkit+Diverse Environments Multi-channel Acoustic Noise Database(VCTK+DEMAND)。实验结果表明,在短时客观可懂度(STOI)和语音质量感知评估(PESQ)评估指标上,所提出的 SE 系统的表现优于最近的几种方法。在 TIMIT 数据集上,建议的系统比噪声混合物的 STOI(17.3%)和 PESQ(0.74)都有显著提高。在 LibriSpeech 数据集上的评估结果显示,STOI 和 PESQ 分别提高了 17.6% 和 0.87%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Digital Signal Processing
Digital Signal Processing 工程技术-工程:电子与电气
CiteScore
5.30
自引率
17.20%
发文量
435
审稿时长
66 days
期刊介绍: Digital Signal Processing: A Review Journal is one of the oldest and most established journals in the field of signal processing yet it aims to be the most innovative. The Journal invites top quality research articles at the frontiers of research in all aspects of signal processing. Our objective is to provide a platform for the publication of ground-breaking research in signal processing with both academic and industrial appeal. The journal has a special emphasis on statistical signal processing methodology such as Bayesian signal processing, and encourages articles on emerging applications of signal processing such as: • big data• machine learning• internet of things• information security• systems biology and computational biology,• financial time series analysis,• autonomous vehicles,• quantum computing,• neuromorphic engineering,• human-computer interaction and intelligent user interfaces,• environmental signal processing,• geophysical signal processing including seismic signal processing,• chemioinformatics and bioinformatics,• audio, visual and performance arts,• disaster management and prevention,• renewable energy,
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信