Improved Normalizing Flow-Based Speech Enhancement Using an all-Pole Gammatone Filterbank for Conditional Input Representation

2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2022-10-21 DOI:10.1109/SLT54892.2023.10022898

Martin Strauss, Matteo Torcoli, B. Edler

引用次数: 1

Abstract

Deep generative models for Speech Enhancement (SE) received increasing attention in recent years. The most prominent example are Generative Adversarial Networks (GANs), while normalizing flows (NF) received less attention despite their potential. Building on previous work, architectural modifications are proposed, along with an investigation of different conditional input representations. Despite being a common choice in related works, Mel-spectrograms demonstrate to be inadequate for the given scenario. Alternatively, a novel All-Pole Gammatone filterbank (APG) with high temporal resolution is proposed. Although computational evaluation metric results would suggest that state-of-the-art GAN-based methods perform best, a perceptual evaluation via a listening test indicates that the presented NF approach (based on time domain and APG) performs best, especially at lower SNRs. On average, APG outputs are rated as having good quality, which is unmatched by the other methods, including GAN.

查看原文本刊更多论文

基于条件输入表示的全极伽玛酮滤波器组的改进归一化流语音增强

语音增强的深度生成模型近年来受到越来越多的关注。最突出的例子是生成对抗网络(gan)，而规范化流(NF)尽管具有潜力，但受到的关注较少。在先前工作的基础上，提出了架构修改，以及对不同条件输入表示的调查。尽管在相关工作中是一种常见的选择，梅尔谱图证明是不适合给定的场景。另外，提出了一种具有高时间分辨率的新型全极伽玛酮滤波器组(APG)。尽管计算评估度量结果表明基于gan的方法表现最好，但通过听力测试进行的感知评估表明，所提出的NF方法(基于时域和APG)表现最好，特别是在较低信噪比下。平均而言，APG输出被评为具有良好的质量，这是其他方法(包括GAN)无法比拟的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 IEEE Spoken Language Technology Workshop (SLT)

自引率

0.00%

发文量