{"title":"SB-SENet: Diffusion model based on Schrödinger bridge for speech enhancement","authors":"Huaifeng Zhang , Guigeng Li , Pengfei Wu , Yong Gao , Hao Zhang","doi":"10.1016/j.apacoust.2025.110742","DOIUrl":null,"url":null,"abstract":"<div><div>Score-based generative models and diffusion models are increasingly being applied in the field of speech enhancement, demonstrating remarkable performance. However, the lack of accurate structural information in mixed speech samples, which combine speech and Gaussian noise, still poses inference challenges and thereby affects speech quality. This paper introduces a novel generative model, SB-SENet, based on the Schrödinger bridge for speech enhancement. The Schrödinger bridge constructs the optimal transport path from the initial probability distribution to the target probability distribution by minimizing the Kullback-Leibler divergence cost function. This process is part of the entropy-regularized optimal path solution, aiming to approximate the noisy speech sample to the clean speech sample through probability distributions to obtain the predicted sample. Unlike diffusion models, which first learn the forward diffusion process from the noisy speech sample to a Gaussian distribution, SB-SENet directly learns the nonlinear diffusion process from the noisy speech sample to the clean speech sample, preserving more structural information about the initial sample. SB-SENet model utilizes a Transformer to capture unique features of time-series signals and a U-Net network to fuse multi-scale information. The loss function is constructed using a score-based generative framework, incorporating phase loss, magnitude loss, and metric loss to gradually reduce the difference between the predicted sample and the clean speech sample. Experimental results show that, in terms of speech quality, the SB-SENet model proposed in this paper achieves a PESQ score of 3.79 on the Voicebank+DEMAND dataset and a PESQ score of 3.65 on the DNS Challenge dataset, achieving state-of-the-art performance compared to recent speech enhancement models. We put the demo sample on a website <span><span>https://svcodec.github.io/SB-SENet.github.io/</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":55506,"journal":{"name":"Applied Acoustics","volume":"236 ","pages":"Article 110742"},"PeriodicalIF":3.4000,"publicationDate":"2025-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Acoustics","FirstCategoryId":"101","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0003682X25002142","RegionNum":2,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0
Abstract
Score-based generative models and diffusion models are increasingly being applied in the field of speech enhancement, demonstrating remarkable performance. However, the lack of accurate structural information in mixed speech samples, which combine speech and Gaussian noise, still poses inference challenges and thereby affects speech quality. This paper introduces a novel generative model, SB-SENet, based on the Schrödinger bridge for speech enhancement. The Schrödinger bridge constructs the optimal transport path from the initial probability distribution to the target probability distribution by minimizing the Kullback-Leibler divergence cost function. This process is part of the entropy-regularized optimal path solution, aiming to approximate the noisy speech sample to the clean speech sample through probability distributions to obtain the predicted sample. Unlike diffusion models, which first learn the forward diffusion process from the noisy speech sample to a Gaussian distribution, SB-SENet directly learns the nonlinear diffusion process from the noisy speech sample to the clean speech sample, preserving more structural information about the initial sample. SB-SENet model utilizes a Transformer to capture unique features of time-series signals and a U-Net network to fuse multi-scale information. The loss function is constructed using a score-based generative framework, incorporating phase loss, magnitude loss, and metric loss to gradually reduce the difference between the predicted sample and the clean speech sample. Experimental results show that, in terms of speech quality, the SB-SENet model proposed in this paper achieves a PESQ score of 3.79 on the Voicebank+DEMAND dataset and a PESQ score of 3.65 on the DNS Challenge dataset, achieving state-of-the-art performance compared to recent speech enhancement models. We put the demo sample on a website https://svcodec.github.io/SB-SENet.github.io/.
期刊介绍:
Since its launch in 1968, Applied Acoustics has been publishing high quality research papers providing state-of-the-art coverage of research findings for engineers and scientists involved in applications of acoustics in the widest sense.
Applied Acoustics looks not only at recent developments in the understanding of acoustics but also at ways of exploiting that understanding. The Journal aims to encourage the exchange of practical experience through publication and in so doing creates a fund of technological information that can be used for solving related problems. The presentation of information in graphical or tabular form is especially encouraged. If a report of a mathematical development is a necessary part of a paper it is important to ensure that it is there only as an integral part of a practical solution to a problem and is supported by data. Applied Acoustics encourages the exchange of practical experience in the following ways: • Complete Papers • Short Technical Notes • Review Articles; and thereby provides a wealth of technological information that can be used to solve related problems.
Manuscripts that address all fields of applications of acoustics ranging from medicine and NDT to the environment and buildings are welcome.