FNSE-SBGAN：基于Schrödinger桥接和生成对抗网络的远场语音增强

IF 3.4 2区物理与天体物理 Q1 ACOUSTICS

Applied Acoustics Pub Date : 2025-09-11 DOI:10.1016/j.apacoust.2025.111050

Tong Lei , Qinwen Hu , Ziyao Lin , Andong Li , Rilin Chen , Meng Yu , Dong Yu , Jing Lu

{"title":"FNSE-SBGAN：基于Schrödinger桥接和生成对抗网络的远场语音增强","authors":"Tong Lei , Qinwen Hu , Ziyao Lin , Andong Li , Rilin Chen , Meng Yu , Dong Yu , Jing Lu","doi":"10.1016/j.apacoust.2025.111050","DOIUrl":null,"url":null,"abstract":"<div><div>The prevailing strategy for neural speech enhancement employs purely-supervised deep learning with simulated pairs of far-field noisy-reverberant speech (i.e., mixtures) and clean speech. However, trained models frequently reveal restricted generalizability to real-recorded mixtures. This limitation is primarily due to the inherent discrepancies between simulated and real-world acoustic environments. To address this issue, this study investigates training enhancement models directly on real mixtures. Specifically, we revisit the single-channel far-field to near-field speech enhancement (FNSE) task, focusing on real-world data characterized by low signal-to-noise ratio (SNR), high reverberation, and mid-to-high-frequency attenuation. We propose FNSE-SBGAN, which presents a unique solution by merging a Schrödinger Bridge (SB)-based diffusion model with generative adversarial networks (GANs). The framework is specifically crafted to directly enhance real-world far-field speech signals. Our experimental results demonstrate FNSE-SBAN's exceptional performance across multiple metrics. It notably decreases the character error rate (CER) by up to 14.58% compared to far-field signals while preserving superior subjective quality, establishing a new benchmark for real-world far-field speech enhancement. Additionally, we introduce an evaluation framework leveraging matrix rank analysis in the time-frequency domain, providing systematic insights into model performance and revealing the strengths and weaknesses of different generative methods.</div></div>","PeriodicalId":55506,"journal":{"name":"Applied Acoustics","volume":"242 ","pages":"Article 111050"},"PeriodicalIF":3.4000,"publicationDate":"2025-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"FNSE-SBGAN: Far-field speech enhancement with Schrödinger bridge and generative adversarial networks\",\"authors\":\"Tong Lei , Qinwen Hu , Ziyao Lin , Andong Li , Rilin Chen , Meng Yu , Dong Yu , Jing Lu\",\"doi\":\"10.1016/j.apacoust.2025.111050\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The prevailing strategy for neural speech enhancement employs purely-supervised deep learning with simulated pairs of far-field noisy-reverberant speech (i.e., mixtures) and clean speech. However, trained models frequently reveal restricted generalizability to real-recorded mixtures. This limitation is primarily due to the inherent discrepancies between simulated and real-world acoustic environments. To address this issue, this study investigates training enhancement models directly on real mixtures. Specifically, we revisit the single-channel far-field to near-field speech enhancement (FNSE) task, focusing on real-world data characterized by low signal-to-noise ratio (SNR), high reverberation, and mid-to-high-frequency attenuation. We propose FNSE-SBGAN, which presents a unique solution by merging a Schrödinger Bridge (SB)-based diffusion model with generative adversarial networks (GANs). The framework is specifically crafted to directly enhance real-world far-field speech signals. Our experimental results demonstrate FNSE-SBAN's exceptional performance across multiple metrics. It notably decreases the character error rate (CER) by up to 14.58% compared to far-field signals while preserving superior subjective quality, establishing a new benchmark for real-world far-field speech enhancement. Additionally, we introduce an evaluation framework leveraging matrix rank analysis in the time-frequency domain, providing systematic insights into model performance and revealing the strengths and weaknesses of different generative methods.</div></div>\",\"PeriodicalId\":55506,\"journal\":{\"name\":\"Applied Acoustics\",\"volume\":\"242 \",\"pages\":\"Article 111050\"},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2025-09-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Applied Acoustics\",\"FirstCategoryId\":\"101\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0003682X25005225\",\"RegionNum\":2,\"RegionCategory\":\"物理与天体物理\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Acoustics","FirstCategoryId":"101","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0003682X25005225","RegionNum":2,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

摘要

神经语音增强的主流策略采用纯监督深度学习，模拟远场噪声混响语音（即混合语音）和干净语音对。然而，经过训练的模型经常显示出对实际记录混合物的有限泛化性。这种限制主要是由于模拟声环境和真实声环境之间的固有差异。为了解决这个问题，本研究直接在真实混合物上研究训练增强模型。具体来说，我们重新审视了单通道远场到近场语音增强（FNSE）任务，重点关注具有低信噪比（SNR）、高混响和中高频衰减特征的现实世界数据。我们提出了FNSE-SBGAN，它通过将基于Schrödinger桥（SB）的扩散模型与生成对抗网络（gan）合并，提出了一种独特的解决方案。该框架专门用于直接增强现实世界的远场语音信号。我们的实验结果证明了FNSE-SBAN在多个指标上的卓越性能。与远场信号相比，显著降低字符错误率（CER）高达14.58%，同时保持了优越的主观质量，为现实世界的远场语音增强建立了新的基准。此外，我们引入了一个利用时频域矩阵秩分析的评估框架，提供了对模型性能的系统见解，并揭示了不同生成方法的优缺点。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

FNSE-SBGAN: Far-field speech enhancement with Schrödinger bridge and generative adversarial networks

The prevailing strategy for neural speech enhancement employs purely-supervised deep learning with simulated pairs of far-field noisy-reverberant speech (i.e., mixtures) and clean speech. However, trained models frequently reveal restricted generalizability to real-recorded mixtures. This limitation is primarily due to the inherent discrepancies between simulated and real-world acoustic environments. To address this issue, this study investigates training enhancement models directly on real mixtures. Specifically, we revisit the single-channel far-field to near-field speech enhancement (FNSE) task, focusing on real-world data characterized by low signal-to-noise ratio (SNR), high reverberation, and mid-to-high-frequency attenuation. We propose FNSE-SBGAN, which presents a unique solution by merging a Schrödinger Bridge (SB)-based diffusion model with generative adversarial networks (GANs). The framework is specifically crafted to directly enhance real-world far-field speech signals. Our experimental results demonstrate FNSE-SBAN's exceptional performance across multiple metrics. It notably decreases the character error rate (CER) by up to 14.58% compared to far-field signals while preserving superior subjective quality, establishing a new benchmark for real-world far-field speech enhancement. Additionally, we introduce an evaluation framework leveraging matrix rank analysis in the time-frequency domain, providing systematic insights into model performance and revealing the strengths and weaknesses of different generative methods.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Applied Acoustics 物理-声学

CiteScore

7.40

自引率

11.80%

发文量

618

审稿时长

7.5 months

期刊介绍： Since its launch in 1968, Applied Acoustics has been publishing high quality research papers providing state-of-the-art coverage of research findings for engineers and scientists involved in applications of acoustics in the widest sense. Applied Acoustics looks not only at recent developments in the understanding of acoustics but also at ways of exploiting that understanding. The Journal aims to encourage the exchange of practical experience through publication and in so doing creates a fund of technological information that can be used for solving related problems. The presentation of information in graphical or tabular form is especially encouraged. If a report of a mathematical development is a necessary part of a paper it is important to ensure that it is there only as an integral part of a practical solution to a problem and is supported by data. Applied Acoustics encourages the exchange of practical experience in the following ways: • Complete Papers • Short Technical Notes • Review Articles; and thereby provides a wealth of technological information that can be used to solve related problems. Manuscripts that address all fields of applications of acoustics ranging from medicine and NDT to the environment and buildings are welcome.