Tong Lei , Qinwen Hu , Ziyao Lin , Andong Li , Rilin Chen , Meng Yu , Dong Yu , Jing Lu
{"title":"FNSE-SBGAN: Far-field speech enhancement with Schrödinger bridge and generative adversarial networks","authors":"Tong Lei , Qinwen Hu , Ziyao Lin , Andong Li , Rilin Chen , Meng Yu , Dong Yu , Jing Lu","doi":"10.1016/j.apacoust.2025.111050","DOIUrl":null,"url":null,"abstract":"<div><div>The prevailing strategy for neural speech enhancement employs purely-supervised deep learning with simulated pairs of far-field noisy-reverberant speech (i.e., mixtures) and clean speech. However, trained models frequently reveal restricted generalizability to real-recorded mixtures. This limitation is primarily due to the inherent discrepancies between simulated and real-world acoustic environments. To address this issue, this study investigates training enhancement models directly on real mixtures. Specifically, we revisit the single-channel far-field to near-field speech enhancement (FNSE) task, focusing on real-world data characterized by low signal-to-noise ratio (SNR), high reverberation, and mid-to-high-frequency attenuation. We propose FNSE-SBGAN, which presents a unique solution by merging a Schrödinger Bridge (SB)-based diffusion model with generative adversarial networks (GANs). The framework is specifically crafted to directly enhance real-world far-field speech signals. Our experimental results demonstrate FNSE-SBAN's exceptional performance across multiple metrics. It notably decreases the character error rate (CER) by up to 14.58% compared to far-field signals while preserving superior subjective quality, establishing a new benchmark for real-world far-field speech enhancement. Additionally, we introduce an evaluation framework leveraging matrix rank analysis in the time-frequency domain, providing systematic insights into model performance and revealing the strengths and weaknesses of different generative methods.</div></div>","PeriodicalId":55506,"journal":{"name":"Applied Acoustics","volume":"242 ","pages":"Article 111050"},"PeriodicalIF":3.4000,"publicationDate":"2025-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Acoustics","FirstCategoryId":"101","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0003682X25005225","RegionNum":2,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0
Abstract
The prevailing strategy for neural speech enhancement employs purely-supervised deep learning with simulated pairs of far-field noisy-reverberant speech (i.e., mixtures) and clean speech. However, trained models frequently reveal restricted generalizability to real-recorded mixtures. This limitation is primarily due to the inherent discrepancies between simulated and real-world acoustic environments. To address this issue, this study investigates training enhancement models directly on real mixtures. Specifically, we revisit the single-channel far-field to near-field speech enhancement (FNSE) task, focusing on real-world data characterized by low signal-to-noise ratio (SNR), high reverberation, and mid-to-high-frequency attenuation. We propose FNSE-SBGAN, which presents a unique solution by merging a Schrödinger Bridge (SB)-based diffusion model with generative adversarial networks (GANs). The framework is specifically crafted to directly enhance real-world far-field speech signals. Our experimental results demonstrate FNSE-SBAN's exceptional performance across multiple metrics. It notably decreases the character error rate (CER) by up to 14.58% compared to far-field signals while preserving superior subjective quality, establishing a new benchmark for real-world far-field speech enhancement. Additionally, we introduce an evaluation framework leveraging matrix rank analysis in the time-frequency domain, providing systematic insights into model performance and revealing the strengths and weaknesses of different generative methods.
期刊介绍:
Since its launch in 1968, Applied Acoustics has been publishing high quality research papers providing state-of-the-art coverage of research findings for engineers and scientists involved in applications of acoustics in the widest sense.
Applied Acoustics looks not only at recent developments in the understanding of acoustics but also at ways of exploiting that understanding. The Journal aims to encourage the exchange of practical experience through publication and in so doing creates a fund of technological information that can be used for solving related problems. The presentation of information in graphical or tabular form is especially encouraged. If a report of a mathematical development is a necessary part of a paper it is important to ensure that it is there only as an integral part of a practical solution to a problem and is supported by data. Applied Acoustics encourages the exchange of practical experience in the following ways: • Complete Papers • Short Technical Notes • Review Articles; and thereby provides a wealth of technological information that can be used to solve related problems.
Manuscripts that address all fields of applications of acoustics ranging from medicine and NDT to the environment and buildings are welcome.