Speech Separation With Pretrained Frontend to Minimize Domain Mismatch

IF 5.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-08-20 DOI:10.1109/TASLP.2024.3446242

Wupeng Wang;Zexu Pan;Xinke Li;Shuai Wang;Haizhou Li

{"title":"Speech Separation With Pretrained Frontend to Minimize Domain Mismatch","authors":"Wupeng Wang;Zexu Pan;Xinke Li;Shuai Wang;Haizhou Li","doi":"10.1109/TASLP.2024.3446242","DOIUrl":null,"url":null,"abstract":"Speech separation seeks to separate individual speech signals from a speech mixture. Typically, most separation models are trained on synthetic data due to the unavailability of target reference in real-world cocktail party scenarios. As a result, there exists a domain gap between real and synthetic data when deploying speech separation models in real-world applications. In this paper, we propose a self-supervised domain-invariant pretrained (DIP) frontend that is exposed to mixture data without the need for target reference speech. The DIP frontend utilizes a Siamese network with two innovative pretext tasks, mixture predictive coding (MPC) and mixture invariant coding (MIC), to capture shared contextual cues between real and synthetic unlabeled mixtures. Subsequently, we freeze the DIP frontend as a feature extractor when training the downstream speech separation models on synthetic data. By pretraining the DIP frontend with the contextual cues, we expect that the speech separation skills learned from synthetic data can be effectively transferred to real data. To benefit from the DIP frontend, we introduce a novel separation pipeline to align the feature resolution of the separation models. We evaluate the speech separation quality on standard benchmarks and real-world datasets. The results confirm the superiority of our DIP frontend over existing speech separation models. This study underscores the potential of large-scale pretraining to enhance the quality and intelligibility of speech separation in real-world applications.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4184-4198"},"PeriodicalIF":5.1000,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10640238/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

Abstract

Speech separation seeks to separate individual speech signals from a speech mixture. Typically, most separation models are trained on synthetic data due to the unavailability of target reference in real-world cocktail party scenarios. As a result, there exists a domain gap between real and synthetic data when deploying speech separation models in real-world applications. In this paper, we propose a self-supervised domain-invariant pretrained (DIP) frontend that is exposed to mixture data without the need for target reference speech. The DIP frontend utilizes a Siamese network with two innovative pretext tasks, mixture predictive coding (MPC) and mixture invariant coding (MIC), to capture shared contextual cues between real and synthetic unlabeled mixtures. Subsequently, we freeze the DIP frontend as a feature extractor when training the downstream speech separation models on synthetic data. By pretraining the DIP frontend with the contextual cues, we expect that the speech separation skills learned from synthetic data can be effectively transferred to real data. To benefit from the DIP frontend, we introduce a novel separation pipeline to align the feature resolution of the separation models. We evaluate the speech separation quality on standard benchmarks and real-world datasets. The results confirm the superiority of our DIP frontend over existing speech separation models. This study underscores the potential of large-scale pretraining to enhance the quality and intelligibility of speech separation in real-world applications.

查看原文本刊更多论文

利用预训练前端进行语音分离，最大限度地减少领域不匹配现象

语音分离的目的是从语音混合物中分离出单独的语音信号。通常情况下，由于在现实世界的鸡尾酒会场景中无法获得目标参考数据，大多数分离模型都是在合成数据上进行训练的。因此，在实际应用中部署语音分离模型时，真实数据和合成数据之间存在领域差距。在本文中，我们提出了一种自监督领域不变性预训练（DIP）前端，该前端无需目标参考语音即可使用混合数据。DIP 前端利用连体网络和两个创新的前置任务--混合预测编码 (MPC) 和混合不变编码 (MIC)，来捕捉真实和合成无标记混合物之间的共享语境线索。随后，我们在合成数据上训练下游语音分离模型时，将 DIP 前端冻结为特征提取器。通过使用上下文线索对 DIP 前端进行预训练，我们希望从合成数据中学到的语音分离技能能有效地转移到真实数据中。为了从 DIP 前端获益，我们引入了一个新颖的分离管道，以调整分离模型的特征分辨率。我们在标准基准和真实世界数据集上对语音分离质量进行了评估。结果证实，我们的 DIP 前端优于现有的语音分离模型。这项研究强调了大规模预训练在实际应用中提高语音分离质量和可懂度的潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE/ACM Transactions on Audio, Speech, and Language Processing ACOUSTICS-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

11.30

自引率

11.10%

发文量

217

期刊介绍： The IEEE/ACM Transactions on Audio, Speech, and Language Processing covers audio, speech and language processing and the sciences that support them. In audio processing: transducers, room acoustics, active sound control, human audition, analysis/synthesis/coding of music, and consumer audio. In speech processing: areas such as speech analysis, synthesis, coding, speech and speaker recognition, speech production and perception, and speech enhancement. In language processing: speech and text analysis, understanding, generation, dialog management, translation, summarization, question answering and document indexing and retrieval, as well as general language modeling.