{"title":"FSformer: Sparsely and effectively learning key features for multi-channel speech enhancement","authors":"Shiyun Xu, Wenjie Zhang, Yinghan Cao, Zehua Zhang, Changjun He, Mingjiang Wang","doi":"10.1016/j.apacoust.2025.110858","DOIUrl":null,"url":null,"abstract":"<div><div>Noise and reverberation can significantly degrade the quality and intelligibility of speech. Therefore, multi-channel speech enhancement models that effectively leverage spatial information have garnered widespread attention. The Transformer architecture has demonstrated impressive performance in multi-channel speech enhancement. However, the redundant features extracted by the self-attention mechanism hinder the network's ability to capture local characteristics, resulting in the loss of speech details. To address the aforementioned issues, we propose the fused sparse transformer (FSformer) to assist the network in learning key features sparsely and effectively. We introduce the fused sparse self-attention (FSSA) module, which selects only the top-<em>k</em> features with the highest contribution scores when computing the self-attention map and employs a fusion strategy to adaptively retain the most valuable features. Furthermore, the local feature refinement extractor (L-FRE) and global feature refinement extractor (G-FRE) are introduced in FSSA to enhance the interaction between global and local features. Additionally, we propose the partial gated feed-forward network (GPFN), which utilizes partial convolution to further enhance the feature extraction capability of the network and employs the gating mechanism to reduce redundancy within channels, thereby compensating for the shortcomings of FSSA. The experimental results indicate that FSformer demonstrates a significant advantage in terms of speech enhancement performance, effectively and naturally improving speech quality and intelligibility, thereby providing a pleasant experience for listeners. Specifically, on the spatialized DNS dataset, FSformer achieves PESQ, STOI, and SI-SDR scores of 3.40, 0.952, and 10.9, respectively. FSformer also demonstrates exceptional performance in suppressing noise and reverberation across various levels of noise and reverberation environments. In the test set containing noise and reverberation, FSformer achieves a PESQ score of 3.41, a STOI score of 0.959, a SI-SDR score of 10.9, a DNSMOS score of 3.525, a CD of 2.527, a LLR of 0.27, and a SNR<span><math><msub><mrow></mrow><mrow><mi>f</mi><mi>w</mi></mrow></msub></math></span> of 13.434. Furthermore, FSformer demonstrates superior generalization capabilities, achieving a DNSMOS of 3.163, a MOS<span><math><msub><mrow></mrow><mrow><mi>P</mi><mo>.</mo><mn>808</mn></mrow></msub></math></span> of 3.762, and an NISQA of 3.779 on real datasets.</div></div>","PeriodicalId":55506,"journal":{"name":"Applied Acoustics","volume":"240 ","pages":"Article 110858"},"PeriodicalIF":3.4000,"publicationDate":"2025-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Acoustics","FirstCategoryId":"101","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0003682X25003305","RegionNum":2,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0
Abstract
Noise and reverberation can significantly degrade the quality and intelligibility of speech. Therefore, multi-channel speech enhancement models that effectively leverage spatial information have garnered widespread attention. The Transformer architecture has demonstrated impressive performance in multi-channel speech enhancement. However, the redundant features extracted by the self-attention mechanism hinder the network's ability to capture local characteristics, resulting in the loss of speech details. To address the aforementioned issues, we propose the fused sparse transformer (FSformer) to assist the network in learning key features sparsely and effectively. We introduce the fused sparse self-attention (FSSA) module, which selects only the top-k features with the highest contribution scores when computing the self-attention map and employs a fusion strategy to adaptively retain the most valuable features. Furthermore, the local feature refinement extractor (L-FRE) and global feature refinement extractor (G-FRE) are introduced in FSSA to enhance the interaction between global and local features. Additionally, we propose the partial gated feed-forward network (GPFN), which utilizes partial convolution to further enhance the feature extraction capability of the network and employs the gating mechanism to reduce redundancy within channels, thereby compensating for the shortcomings of FSSA. The experimental results indicate that FSformer demonstrates a significant advantage in terms of speech enhancement performance, effectively and naturally improving speech quality and intelligibility, thereby providing a pleasant experience for listeners. Specifically, on the spatialized DNS dataset, FSformer achieves PESQ, STOI, and SI-SDR scores of 3.40, 0.952, and 10.9, respectively. FSformer also demonstrates exceptional performance in suppressing noise and reverberation across various levels of noise and reverberation environments. In the test set containing noise and reverberation, FSformer achieves a PESQ score of 3.41, a STOI score of 0.959, a SI-SDR score of 10.9, a DNSMOS score of 3.525, a CD of 2.527, a LLR of 0.27, and a SNR of 13.434. Furthermore, FSformer demonstrates superior generalization capabilities, achieving a DNSMOS of 3.163, a MOS of 3.762, and an NISQA of 3.779 on real datasets.
期刊介绍:
Since its launch in 1968, Applied Acoustics has been publishing high quality research papers providing state-of-the-art coverage of research findings for engineers and scientists involved in applications of acoustics in the widest sense.
Applied Acoustics looks not only at recent developments in the understanding of acoustics but also at ways of exploiting that understanding. The Journal aims to encourage the exchange of practical experience through publication and in so doing creates a fund of technological information that can be used for solving related problems. The presentation of information in graphical or tabular form is especially encouraged. If a report of a mathematical development is a necessary part of a paper it is important to ensure that it is there only as an integral part of a practical solution to a problem and is supported by data. Applied Acoustics encourages the exchange of practical experience in the following ways: • Complete Papers • Short Technical Notes • Review Articles; and thereby provides a wealth of technological information that can be used to solve related problems.
Manuscripts that address all fields of applications of acoustics ranging from medicine and NDT to the environment and buildings are welcome.