{"title":"A comprehensive study on supervised single-channel noisy speech separation with multi-task learning","authors":"Shaoxiang Dang , Tetsuya Matsumoto , Yoshinori Takeuchi , Hiroaki Kudo","doi":"10.1016/j.specom.2024.103162","DOIUrl":null,"url":null,"abstract":"<div><div>This research presents a comprehensive investigation and comparison of noisy speech separation methods using multi-task learning. First, we categorize all methods into two pipelines: enhancement priority pipeline (EPP) and separation priority pipeline (SPP), based on whether prioritizing enhancement or separation. Next, we classify each pipeline into shared encoder–decoder scheme (SEDS) and independent encoder–decoder scheme (IEDS), depending on whether the two modules share the same encoder and decoder. Additionally, we introduce two types of intermediate structures between the two modules. One structure uses time–frequency (T–F) representations, while the other uses T–F masks. This article elaborates on the strengths and weaknesses of each approach, particularly in mitigating over-suppression and improving computational efficiency. Our experiments show substantial improvements in SPP with IEDS across multiple metrics on the LibriXmix dataset. In addition, by replacing the synthesis-based trick in the enhancement module, the model achieves superior generalization on the LibriCSS dataset.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"167 ","pages":"Article 103162"},"PeriodicalIF":2.4000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S016763932400133X","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0
Abstract
This research presents a comprehensive investigation and comparison of noisy speech separation methods using multi-task learning. First, we categorize all methods into two pipelines: enhancement priority pipeline (EPP) and separation priority pipeline (SPP), based on whether prioritizing enhancement or separation. Next, we classify each pipeline into shared encoder–decoder scheme (SEDS) and independent encoder–decoder scheme (IEDS), depending on whether the two modules share the same encoder and decoder. Additionally, we introduce two types of intermediate structures between the two modules. One structure uses time–frequency (T–F) representations, while the other uses T–F masks. This article elaborates on the strengths and weaknesses of each approach, particularly in mitigating over-suppression and improving computational efficiency. Our experiments show substantial improvements in SPP with IEDS across multiple metrics on the LibriXmix dataset. In addition, by replacing the synthesis-based trick in the enhancement module, the model achieves superior generalization on the LibriCSS dataset.
期刊介绍:
Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results.
The journal''s primary objectives are:
• to present a forum for the advancement of human and human-machine speech communication science;
• to stimulate cross-fertilization between different fields of this domain;
• to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.