SB-SENet：基于Schrödinger桥的语音增强扩散模型

IF 3.4 2区物理与天体物理 Q1 ACOUSTICS

Applied Acoustics Pub Date : 2025-04-25 DOI:10.1016/j.apacoust.2025.110742

Huaifeng Zhang , Guigeng Li , Pengfei Wu , Yong Gao , Hao Zhang

{"title":"SB-SENet：基于Schrödinger桥的语音增强扩散模型","authors":"Huaifeng Zhang , Guigeng Li , Pengfei Wu , Yong Gao , Hao Zhang","doi":"10.1016/j.apacoust.2025.110742","DOIUrl":null,"url":null,"abstract":"<div><div>Score-based generative models and diffusion models are increasingly being applied in the field of speech enhancement, demonstrating remarkable performance. However, the lack of accurate structural information in mixed speech samples, which combine speech and Gaussian noise, still poses inference challenges and thereby affects speech quality. This paper introduces a novel generative model, SB-SENet, based on the Schrödinger bridge for speech enhancement. The Schrödinger bridge constructs the optimal transport path from the initial probability distribution to the target probability distribution by minimizing the Kullback-Leibler divergence cost function. This process is part of the entropy-regularized optimal path solution, aiming to approximate the noisy speech sample to the clean speech sample through probability distributions to obtain the predicted sample. Unlike diffusion models, which first learn the forward diffusion process from the noisy speech sample to a Gaussian distribution, SB-SENet directly learns the nonlinear diffusion process from the noisy speech sample to the clean speech sample, preserving more structural information about the initial sample. SB-SENet model utilizes a Transformer to capture unique features of time-series signals and a U-Net network to fuse multi-scale information. The loss function is constructed using a score-based generative framework, incorporating phase loss, magnitude loss, and metric loss to gradually reduce the difference between the predicted sample and the clean speech sample. Experimental results show that, in terms of speech quality, the SB-SENet model proposed in this paper achieves a PESQ score of 3.79 on the Voicebank+DEMAND dataset and a PESQ score of 3.65 on the DNS Challenge dataset, achieving state-of-the-art performance compared to recent speech enhancement models. We put the demo sample on a website <span><span>https://svcodec.github.io/SB-SENet.github.io/</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":55506,"journal":{"name":"Applied Acoustics","volume":"236 ","pages":"Article 110742"},"PeriodicalIF":3.4000,"publicationDate":"2025-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"SB-SENet: Diffusion model based on Schrödinger bridge for speech enhancement\",\"authors\":\"Huaifeng Zhang , Guigeng Li , Pengfei Wu , Yong Gao , Hao Zhang\",\"doi\":\"10.1016/j.apacoust.2025.110742\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Score-based generative models and diffusion models are increasingly being applied in the field of speech enhancement, demonstrating remarkable performance. However, the lack of accurate structural information in mixed speech samples, which combine speech and Gaussian noise, still poses inference challenges and thereby affects speech quality. This paper introduces a novel generative model, SB-SENet, based on the Schrödinger bridge for speech enhancement. The Schrödinger bridge constructs the optimal transport path from the initial probability distribution to the target probability distribution by minimizing the Kullback-Leibler divergence cost function. This process is part of the entropy-regularized optimal path solution, aiming to approximate the noisy speech sample to the clean speech sample through probability distributions to obtain the predicted sample. Unlike diffusion models, which first learn the forward diffusion process from the noisy speech sample to a Gaussian distribution, SB-SENet directly learns the nonlinear diffusion process from the noisy speech sample to the clean speech sample, preserving more structural information about the initial sample. SB-SENet model utilizes a Transformer to capture unique features of time-series signals and a U-Net network to fuse multi-scale information. The loss function is constructed using a score-based generative framework, incorporating phase loss, magnitude loss, and metric loss to gradually reduce the difference between the predicted sample and the clean speech sample. Experimental results show that, in terms of speech quality, the SB-SENet model proposed in this paper achieves a PESQ score of 3.79 on the Voicebank+DEMAND dataset and a PESQ score of 3.65 on the DNS Challenge dataset, achieving state-of-the-art performance compared to recent speech enhancement models. We put the demo sample on a website <span><span>https://svcodec.github.io/SB-SENet.github.io/</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":55506,\"journal\":{\"name\":\"Applied Acoustics\",\"volume\":\"236 \",\"pages\":\"Article 110742\"},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2025-04-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Applied Acoustics\",\"FirstCategoryId\":\"101\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0003682X25002142\",\"RegionNum\":2,\"RegionCategory\":\"物理与天体物理\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Acoustics","FirstCategoryId":"101","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0003682X25002142","RegionNum":2,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

摘要

基于分数的生成模型和扩散模型越来越多地应用于语音增强领域，并显示出显著的性能。然而，在混合语音样本中，由于缺乏准确的结构信息，仍然会给推理带来挑战，从而影响语音质量。本文介绍了一种基于Schrödinger桥的语音增强生成模型SB-SENet。Schrödinger桥通过最小化Kullback-Leibler散度代价函数，构建了从初始概率分布到目标概率分布的最优传输路径。该过程是熵正则化最优路径解的一部分，旨在通过概率分布将有噪声的语音样本近似于干净的语音样本，从而得到预测样本。与扩散模型首先学习从有噪声语音样本到高斯分布的前向扩散过程不同，SB-SENet直接学习从有噪声语音样本到干净语音样本的非线性扩散过程，保留了初始样本更多的结构信息。SB-SENet模型利用Transformer捕获时间序列信号的独特特征，利用U-Net网络融合多尺度信息。使用基于分数的生成框架构建损失函数，结合相位损失、幅度损失和度量损失，逐渐减小预测样本与干净语音样本之间的差异。实验结果表明，在语音质量方面，本文提出的SB-SENet模型在Voicebank+DEMAND数据集上的PESQ得分为3.79，在DNS Challenge数据集上的PESQ得分为3.65，与最近的语音增强模型相比，达到了最先进的性能。我们将演示样本放在了一个网站https://svcodec.github.io/SB-SENet.github.io/上。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

SB-SENet: Diffusion model based on Schrödinger bridge for speech enhancement

Score-based generative models and diffusion models are increasingly being applied in the field of speech enhancement, demonstrating remarkable performance. However, the lack of accurate structural information in mixed speech samples, which combine speech and Gaussian noise, still poses inference challenges and thereby affects speech quality. This paper introduces a novel generative model, SB-SENet, based on the Schrödinger bridge for speech enhancement. The Schrödinger bridge constructs the optimal transport path from the initial probability distribution to the target probability distribution by minimizing the Kullback-Leibler divergence cost function. This process is part of the entropy-regularized optimal path solution, aiming to approximate the noisy speech sample to the clean speech sample through probability distributions to obtain the predicted sample. Unlike diffusion models, which first learn the forward diffusion process from the noisy speech sample to a Gaussian distribution, SB-SENet directly learns the nonlinear diffusion process from the noisy speech sample to the clean speech sample, preserving more structural information about the initial sample. SB-SENet model utilizes a Transformer to capture unique features of time-series signals and a U-Net network to fuse multi-scale information. The loss function is constructed using a score-based generative framework, incorporating phase loss, magnitude loss, and metric loss to gradually reduce the difference between the predicted sample and the clean speech sample. Experimental results show that, in terms of speech quality, the SB-SENet model proposed in this paper achieves a PESQ score of 3.79 on the Voicebank+DEMAND dataset and a PESQ score of 3.65 on the DNS Challenge dataset, achieving state-of-the-art performance compared to recent speech enhancement models. We put the demo sample on a website https://svcodec.github.io/SB-SENet.github.io/.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Applied Acoustics 物理-声学

CiteScore

7.40

自引率

11.80%

发文量

618

审稿时长

7.5 months

期刊介绍： Since its launch in 1968, Applied Acoustics has been publishing high quality research papers providing state-of-the-art coverage of research findings for engineers and scientists involved in applications of acoustics in the widest sense. Applied Acoustics looks not only at recent developments in the understanding of acoustics but also at ways of exploiting that understanding. The Journal aims to encourage the exchange of practical experience through publication and in so doing creates a fund of technological information that can be used for solving related problems. The presentation of information in graphical or tabular form is especially encouraged. If a report of a mathematical development is a necessary part of a paper it is important to ensure that it is there only as an integral part of a practical solution to a problem and is supported by data. Applied Acoustics encourages the exchange of practical experience in the following ways: • Complete Papers • Short Technical Notes • Review Articles; and thereby provides a wealth of technological information that can be used to solve related problems. Manuscripts that address all fields of applications of acoustics ranging from medicine and NDT to the environment and buildings are welcome.