FSformer: Sparsely and effectively learning key features for multi-channel speech enhancement

IF 3.4 2区物理与天体物理 Q1 ACOUSTICS

Applied Acoustics Pub Date : 2025-06-04 DOI:10.1016/j.apacoust.2025.110858

Shiyun Xu, Wenjie Zhang, Yinghan Cao, Zehua Zhang, Changjun He, Mingjiang Wang

{"title":"FSformer: Sparsely and effectively learning key features for multi-channel speech enhancement","authors":"Shiyun Xu, Wenjie Zhang, Yinghan Cao, Zehua Zhang, Changjun He, Mingjiang Wang","doi":"10.1016/j.apacoust.2025.110858","DOIUrl":null,"url":null,"abstract":"<div><div>Noise and reverberation can significantly degrade the quality and intelligibility of speech. Therefore, multi-channel speech enhancement models that effectively leverage spatial information have garnered widespread attention. The Transformer architecture has demonstrated impressive performance in multi-channel speech enhancement. However, the redundant features extracted by the self-attention mechanism hinder the network's ability to capture local characteristics, resulting in the loss of speech details. To address the aforementioned issues, we propose the fused sparse transformer (FSformer) to assist the network in learning key features sparsely and effectively. We introduce the fused sparse self-attention (FSSA) module, which selects only the top-<em>k</em> features with the highest contribution scores when computing the self-attention map and employs a fusion strategy to adaptively retain the most valuable features. Furthermore, the local feature refinement extractor (L-FRE) and global feature refinement extractor (G-FRE) are introduced in FSSA to enhance the interaction between global and local features. Additionally, we propose the partial gated feed-forward network (GPFN), which utilizes partial convolution to further enhance the feature extraction capability of the network and employs the gating mechanism to reduce redundancy within channels, thereby compensating for the shortcomings of FSSA. The experimental results indicate that FSformer demonstrates a significant advantage in terms of speech enhancement performance, effectively and naturally improving speech quality and intelligibility, thereby providing a pleasant experience for listeners. Specifically, on the spatialized DNS dataset, FSformer achieves PESQ, STOI, and SI-SDR scores of 3.40, 0.952, and 10.9, respectively. FSformer also demonstrates exceptional performance in suppressing noise and reverberation across various levels of noise and reverberation environments. In the test set containing noise and reverberation, FSformer achieves a PESQ score of 3.41, a STOI score of 0.959, a SI-SDR score of 10.9, a DNSMOS score of 3.525, a CD of 2.527, a LLR of 0.27, and a SNR<span><math><msub><mrow></mrow><mrow><mi>f</mi><mi>w</mi></mrow></msub></math></span> of 13.434. Furthermore, FSformer demonstrates superior generalization capabilities, achieving a DNSMOS of 3.163, a MOS<span><math><msub><mrow></mrow><mrow><mi>P</mi><mo>.</mo><mn>808</mn></mrow></msub></math></span> of 3.762, and an NISQA of 3.779 on real datasets.</div></div>","PeriodicalId":55506,"journal":{"name":"Applied Acoustics","volume":"240 ","pages":"Article 110858"},"PeriodicalIF":3.4000,"publicationDate":"2025-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Acoustics","FirstCategoryId":"101","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0003682X25003305","RegionNum":2,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

Abstract

Noise and reverberation can significantly degrade the quality and intelligibility of speech. Therefore, multi-channel speech enhancement models that effectively leverage spatial information have garnered widespread attention. The Transformer architecture has demonstrated impressive performance in multi-channel speech enhancement. However, the redundant features extracted by the self-attention mechanism hinder the network's ability to capture local characteristics, resulting in the loss of speech details. To address the aforementioned issues, we propose the fused sparse transformer (FSformer) to assist the network in learning key features sparsely and effectively. We introduce the fused sparse self-attention (FSSA) module, which selects only the top-k features with the highest contribution scores when computing the self-attention map and employs a fusion strategy to adaptively retain the most valuable features. Furthermore, the local feature refinement extractor (L-FRE) and global feature refinement extractor (G-FRE) are introduced in FSSA to enhance the interaction between global and local features. Additionally, we propose the partial gated feed-forward network (GPFN), which utilizes partial convolution to further enhance the feature extraction capability of the network and employs the gating mechanism to reduce redundancy within channels, thereby compensating for the shortcomings of FSSA. The experimental results indicate that FSformer demonstrates a significant advantage in terms of speech enhancement performance, effectively and naturally improving speech quality and intelligibility, thereby providing a pleasant experience for listeners. Specifically, on the spatialized DNS dataset, FSformer achieves PESQ, STOI, and SI-SDR scores of 3.40, 0.952, and 10.9, respectively. FSformer also demonstrates exceptional performance in suppressing noise and reverberation across various levels of noise and reverberation environments. In the test set containing noise and reverberation, FSformer achieves a PESQ score of 3.41, a STOI score of 0.959, a SI-SDR score of 10.9, a DNSMOS score of 3.525, a CD of 2.527, a LLR of 0.27, and a SNR

_{f w}

of 13.434. Furthermore, FSformer demonstrates superior generalization capabilities, achieving a DNSMOS of 3.163, a MOS

_{P . 808}

of 3.762, and an NISQA of 3.779 on real datasets.

查看原文本刊更多论文

ftransformer：稀疏有效地学习多通道语音增强的关键特征

噪声和混响会显著降低语音的质量和可理解性。因此，有效利用空间信息的多通道语音增强模型得到了广泛关注。Transformer架构在多通道语音增强方面表现出令人印象深刻的性能。然而，自注意机制提取的冗余特征阻碍了网络捕捉局部特征的能力，导致语音细节的丢失。为了解决上述问题，我们提出了融合稀疏变压器（FSformer）来帮助网络稀疏有效地学习关键特征。引入融合稀疏自注意（FSSA）模块，在计算自注意图时只选择贡献分数最高的前k个特征，并采用融合策略自适应保留最有价值的特征。在此基础上，引入了局部特征细化提取器（L-FRE）和全局特征细化提取器（G-FRE），增强了全局特征与局部特征的交互作用。此外，我们提出了部分门控前馈网络（GPFN），该网络利用部分卷积进一步增强网络的特征提取能力，并利用门控机制减少通道内的冗余，从而弥补了FSSA的不足。实验结果表明，FSformer在语音增强性能方面具有显著优势，能够有效、自然地提高语音质量和清晰度，为听者提供愉悦的语音体验。其中，FSformer在空间化DNS数据集上的PESQ、STOI和SI-SDR得分分别为3.40、0.952和10.9。FSformer在各种噪声和混响环境中抑制噪声和混响方面也表现出卓越的性能。在包含噪声和混响的测试集中，FSformer的PESQ得分为3.41，STOI得分为0.959，SI-SDR得分为10.9，DNSMOS得分为3.525，CD为2.527，LLR为0.27，SNRfw为13.434。此外，FSformer在实际数据集上的DNSMOS为3.163，MOSP.808为3.762，NISQA为3.779，显示出卓越的泛化能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Applied Acoustics 物理-声学

CiteScore

7.40

自引率

11.80%

发文量

618

审稿时长

7.5 months

期刊介绍： Since its launch in 1968, Applied Acoustics has been publishing high quality research papers providing state-of-the-art coverage of research findings for engineers and scientists involved in applications of acoustics in the widest sense. Applied Acoustics looks not only at recent developments in the understanding of acoustics but also at ways of exploiting that understanding. The Journal aims to encourage the exchange of practical experience through publication and in so doing creates a fund of technological information that can be used for solving related problems. The presentation of information in graphical or tabular form is especially encouraged. If a report of a mathematical development is a necessary part of a paper it is important to ensure that it is there only as an integral part of a practical solution to a problem and is supported by data. Applied Acoustics encourages the exchange of practical experience in the following ways: • Complete Papers • Short Technical Notes • Review Articles; and thereby provides a wealth of technological information that can be used to solve related problems. Manuscripts that address all fields of applications of acoustics ranging from medicine and NDT to the environment and buildings are welcome.