Sparse, Efficient, and Semantic Mixture Invariant Training: Taming In-the-Wild Unsupervised Sound Separation

2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) Pub Date : 2021-06-01 DOI:10.1109/WASPAA52581.2021.9632714

Scott Wisdom, A. Jansen, Ron J. Weiss, Hakan Erdogan, J. Hershey

{"title":"Sparse, Efficient, and Semantic Mixture Invariant Training: Taming In-the-Wild Unsupervised Sound Separation","authors":"Scott Wisdom, A. Jansen, Ron J. Weiss, Hakan Erdogan, J. Hershey","doi":"10.1109/WASPAA52581.2021.9632714","DOIUrl":null,"url":null,"abstract":"Supervised neural network training has led to significant progress on single-channel sound separation. This approach relies on ground truth isolated sources, which precludes scaling to widely available mixture data and limits progress on open-domain tasks. The recent mixture invariant training (MixIT) method enables training on in-the-wild data; however, it suffers from two outstanding problems. First, it produces models which tend to over-separate, producing more output sources than are present in the input. Second, the exponential computational complexity of the MixIT loss limits the number of feasible output sources. In this paper we address both issues. To combat over-separation we introduce new losses: sparsity losses that favor fewer output sources and a covariance loss that discourages correlated outputs. We also experiment with a semantic classification loss by predicting weak class labels for each mixture. To handle larger numbers of sources, we introduce an efficient approximation using a fast least-squares solution, projected onto the MixIT constraint set. Our experiments show that the proposed losses curtail over-separation and improve overall performance. The best performance is achieved using larger numbers of output sources, enabled by our efficient MixIT loss, combined with sparsity losses to prevent over-separation. On the FUSS test set, we achieve over 13 dB in multi-source SI-SNR improvement, while boosting single-source reconstruction SI-SNR by over 17 dB.","PeriodicalId":429900,"journal":{"name":"2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WASPAA52581.2021.9632714","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 17

Abstract

Supervised neural network training has led to significant progress on single-channel sound separation. This approach relies on ground truth isolated sources, which precludes scaling to widely available mixture data and limits progress on open-domain tasks. The recent mixture invariant training (MixIT) method enables training on in-the-wild data; however, it suffers from two outstanding problems. First, it produces models which tend to over-separate, producing more output sources than are present in the input. Second, the exponential computational complexity of the MixIT loss limits the number of feasible output sources. In this paper we address both issues. To combat over-separation we introduce new losses: sparsity losses that favor fewer output sources and a covariance loss that discourages correlated outputs. We also experiment with a semantic classification loss by predicting weak class labels for each mixture. To handle larger numbers of sources, we introduce an efficient approximation using a fast least-squares solution, projected onto the MixIT constraint set. Our experiments show that the proposed losses curtail over-separation and improve overall performance. The best performance is achieved using larger numbers of output sources, enabled by our efficient MixIT loss, combined with sparsity losses to prevent over-separation. On the FUSS test set, we achieve over 13 dB in multi-source SI-SNR improvement, while boosting single-source reconstruction SI-SNR by over 17 dB.

查看原文本刊更多论文

稀疏、高效和语义混合不变训练:驯服野外无监督声音分离

监督神经网络训练在单通道声音分离方面取得了重大进展。这种方法依赖于地面真实隔离源，这妨碍了扩展到广泛可用的混合数据，并限制了开放域任务的进展。最近的混合不变量训练(MixIT)方法可以在野外数据上进行训练;然而，它有两个突出的问题。首先，它产生的模型倾向于过度分离，产生比输入中存在的更多的输出源。其次，MixIT损失的指数计算复杂度限制了可行输出源的数量。在本文中，我们解决了这两个问题。为了对抗过度分离，我们引入了新的损失:支持较少输出源的稀疏性损失和不鼓励相关输出的协方差损失。我们还通过预测每个混合的弱类标签来实验语义分类损失。为了处理更多的源，我们使用快速最小二乘解决方案引入了一个有效的近似，并将其投影到MixIT约束集上。我们的实验表明，所提出的损失减少了过度分离，提高了整体性能。使用大量的输出源可以实现最佳性能，这得益于我们高效的MixIT损耗，并结合稀疏性损耗来防止过度分离。在FUSS测试集上，我们实现了超过13 dB的多源si -信噪比改善，同时将单源重构si -信噪比提高了超过17 dB。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)

自引率

0.00%

发文量