Investigating the Design Space of Diffusion Models for Speech Enhancement

IF 5.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-10-03 DOI:10.1109/TASLP.2024.3473319

Philippe Gonzalez;Zheng-Hua Tan;Jan Østergaard;Jesper Jensen;Tommy Sonne Alstrøm;Tobias May

{"title":"Investigating the Design Space of Diffusion Models for Speech Enhancement","authors":"Philippe Gonzalez;Zheng-Hua Tan;Jan Østergaard;Jesper Jensen;Tommy Sonne Alstrøm;Tobias May","doi":"10.1109/TASLP.2024.3473319","DOIUrl":null,"url":null,"abstract":"Diffusion models are a new class of generative models that have shown outstanding performance in image generation literature. As a consequence, studies have attempted to apply diffusion models to other tasks, such as speech enhancement. A popular approach in adapting diffusion models to speech enhancement consists in modelling a progressive transformation between the clean and noisy speech signals. However, one popular diffusion model framework previously laid in image generation literature did not account for such a transformation towards the system input, which prevents from relating the existing diffusion-based speech enhancement systems with the aforementioned diffusion model framework. To address this, we extend this framework to account for the progressive transformation between the clean and noisy speech signals. This allows us to apply recent developments from image generation literature, and to systematically investigate design aspects of diffusion models that remain largely unexplored for speech enhancement, such as the neural network preconditioning, the training loss weighting, the stochastic differential equation (SDE), or the amount of stochasticity injected in the reverse process. We show that the performance of previous diffusion-based speech enhancement systems cannot be attributed to the progressive transformation between the clean and noisy speech signals. Moreover, we show that a proper choice of preconditioning, training loss weighting, SDE and sampler allows to outperform a popular diffusion-based speech enhancement system while using fewer sampling steps, thus reducing the computational cost by a factor of four.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4486-4500"},"PeriodicalIF":5.1000,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10704960","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10704960/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

Abstract

Diffusion models are a new class of generative models that have shown outstanding performance in image generation literature. As a consequence, studies have attempted to apply diffusion models to other tasks, such as speech enhancement. A popular approach in adapting diffusion models to speech enhancement consists in modelling a progressive transformation between the clean and noisy speech signals. However, one popular diffusion model framework previously laid in image generation literature did not account for such a transformation towards the system input, which prevents from relating the existing diffusion-based speech enhancement systems with the aforementioned diffusion model framework. To address this, we extend this framework to account for the progressive transformation between the clean and noisy speech signals. This allows us to apply recent developments from image generation literature, and to systematically investigate design aspects of diffusion models that remain largely unexplored for speech enhancement, such as the neural network preconditioning, the training loss weighting, the stochastic differential equation (SDE), or the amount of stochasticity injected in the reverse process. We show that the performance of previous diffusion-based speech enhancement systems cannot be attributed to the progressive transformation between the clean and noisy speech signals. Moreover, we show that a proper choice of preconditioning, training loss weighting, SDE and sampler allows to outperform a popular diffusion-based speech enhancement system while using fewer sampling steps, thus reducing the computational cost by a factor of four.

查看原文本刊更多论文

调查用于语音增强的扩散模型设计空间

扩散模型是一类新的生成模型，在图像生成方面表现出色。因此，研究人员尝试将扩散模型应用到语音增强等其他任务中。将扩散模型应用于语音增强的一种流行方法是在干净语音信号和噪声语音信号之间建立渐进转换模型。然而，之前在图像生成文献中使用的一个流行的扩散模型框架并没有考虑到系统输入的这种转换，这就阻碍了将现有的基于扩散的语音增强系统与上述扩散模型框架联系起来。为了解决这一问题，我们扩展了这一框架，以考虑干净语音信号和噪声语音信号之间的渐进转换。这使我们能够应用图像生成文献中的最新进展，并系统地研究扩散模型的设计方面，这些方面在语音增强方面大多仍未被探索，例如神经网络预处理、训练损失加权、随机微分方程（SDE）或反向过程中注入的随机性量。我们证明，以往基于扩散的语音增强系统的性能不能归因于干净语音信号和噪声语音信号之间的渐进转换。此外，我们还证明，适当选择前置条件、训练损耗加权、SDE 和采样器，可以在使用较少采样步骤的情况下超越流行的基于扩散的语音增强系统，从而将计算成本降低四倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE/ACM Transactions on Audio, Speech, and Language Processing ACOUSTICS-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

11.30

自引率

11.10%

发文量

217

期刊介绍： The IEEE/ACM Transactions on Audio, Speech, and Language Processing covers audio, speech and language processing and the sciences that support them. In audio processing: transducers, room acoustics, active sound control, human audition, analysis/synthesis/coding of music, and consumer audio. In speech processing: areas such as speech analysis, synthesis, coding, speech and speaker recognition, speech production and perception, and speech enhancement. In language processing: speech and text analysis, understanding, generation, dialog management, translation, summarization, question answering and document indexing and retrieval, as well as general language modeling.