USDnet: Unsupervised Speech Dereverberation via Neural Forward Filtering

IF 5.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-08-16 DOI:10.1109/TASLP.2024.3445120

Zhong-Qiu Wang

{"title":"USDnet: Unsupervised Speech Dereverberation via Neural Forward Filtering","authors":"Zhong-Qiu Wang","doi":"10.1109/TASLP.2024.3445120","DOIUrl":null,"url":null,"abstract":"In reverberant conditions with a single speaker, each far-field microphone records a reverberant version of the same speaker signal at a different location. In over-determined conditions, where there are multiple microphones but only one speaker, each recorded mixture signal can be leveraged as a constraint to narrow down the solutions to target anechoic speech and thereby reduce reverberation. Equipped with this insight, we propose USDnet, a novel deep neural network (DNN) approach for unsupervised speech dereverberation (USD). At each training step, we first feed an input mixture to USDnet to produce an estimate for target speech, and then linearly filter the DNN estimate to approximate the multi-microphone mixture so that the constraint can be satisfied at each microphone, thereby regularizing the DNN estimate to approximate target anechoic speech. The linear filter can be estimated based on the mixture and DNN estimate via neural forward filtering algorithms such as forward convolutive prediction. We show that this novel methodology can promote unsupervised dereverberation of single-source reverberant speech.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3882-3895"},"PeriodicalIF":5.1000,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10638210/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

Abstract

In reverberant conditions with a single speaker, each far-field microphone records a reverberant version of the same speaker signal at a different location. In over-determined conditions, where there are multiple microphones but only one speaker, each recorded mixture signal can be leveraged as a constraint to narrow down the solutions to target anechoic speech and thereby reduce reverberation. Equipped with this insight, we propose USDnet, a novel deep neural network (DNN) approach for unsupervised speech dereverberation (USD). At each training step, we first feed an input mixture to USDnet to produce an estimate for target speech, and then linearly filter the DNN estimate to approximate the multi-microphone mixture so that the constraint can be satisfied at each microphone, thereby regularizing the DNN estimate to approximate target anechoic speech. The linear filter can be estimated based on the mixture and DNN estimate via neural forward filtering algorithms such as forward convolutive prediction. We show that this novel methodology can promote unsupervised dereverberation of single-source reverberant speech.

查看原文本刊更多论文

USDnet：通过神经前向滤波实现无监督语音消除混响

在单个扬声器的混响条件下，每个远场麦克风都会在不同位置记录同一扬声器信号的混响版本。在过度确定的条件下，即有多个麦克风但只有一个扬声器时，每个记录的混合信号都可以作为约束条件加以利用，以缩小针对消声语音的解决方案的范围，从而减少混响。有鉴于此，我们提出了一种用于无监督语音消除混响（USD）的新型深度神经网络（DNN）方法--USDnet。在每个训练步骤中，我们首先向 USDnet 输入一个输入混合物，以生成目标语音的估计值，然后对 DNN 估计值进行线性过滤，以近似多麦克风混合物，从而使每个麦克风都能满足约束条件，从而使 DNN 估计值正则化，以近似目标消混响语音。线性滤波器可通过神经前向滤波算法（如前向卷积预测）根据混合物和 DNN 估计值进行估计。我们的研究表明，这种新方法可以促进单源混响语音的无监督消除混响。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE/ACM Transactions on Audio, Speech, and Language Processing ACOUSTICS-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

11.30

自引率

11.10%

发文量

217

期刊介绍： The IEEE/ACM Transactions on Audio, Speech, and Language Processing covers audio, speech and language processing and the sciences that support them. In audio processing: transducers, room acoustics, active sound control, human audition, analysis/synthesis/coding of music, and consumer audio. In speech processing: areas such as speech analysis, synthesis, coding, speech and speaker recognition, speech production and perception, and speech enhancement. In language processing: speech and text analysis, understanding, generation, dialog management, translation, summarization, question answering and document indexing and retrieval, as well as general language modeling.