Deep encoder and decoder for time-domain speech separation

IF 0.4 Q4 ENGINEERING, MECHANICAL

Mechanical Engineering Journal Pub Date : 2023-01-01 DOI:10.1299/mej.23-00124

Kohei TAKAHASHI, Toshihiko SHIRAISHI

{"title":"Deep encoder and decoder for time-domain speech separation","authors":"Kohei TAKAHASHI, Toshihiko SHIRAISHI","doi":"10.1299/mej.23-00124","DOIUrl":null,"url":null,"abstract":"The previous research of speech separation has significantly improved separation performance based on the time-domain method: encoder, separator, and decoder. Most research has focused on revising the architecture of the separator. In contrast, a single 1-D convolution layer and 1-D transposed convolution layer have been used as encoder and decoder, respectively. This study proposes deep encoder and decoder architectures, consisting of stacked 1-D convolution layers, 1-D transposed convolution layers, or residual blocks, for the time-domain speech separation. The intentions of revising them are to improve separation performance and overcome the tradeoff between separation performance and computational cost due to their stride by enhancing their mapping ability. We applied them to Conv-TasNet, the typical model in the time-domain speech separation. Our results indicate that the better separation performance is archived as the number of their layers increases and that changing the number of their layers from 1 to 12 results in more than 1 dB improvement of SI-SDR on WSJ0-2mix. Additionally, it is suggested that the encoder and decoder should be deeper, corresponding to their stride since their task may be more difficult as the stride becomes larger. This study represents the importance of improving these architectures as well as separators.","PeriodicalId":45233,"journal":{"name":"Mechanical Engineering Journal","volume":"8 1","pages":"0"},"PeriodicalIF":0.4000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Mechanical Engineering Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1299/mej.23-00124","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"ENGINEERING, MECHANICAL","Score":null,"Total":0}

引用次数: 0

Abstract

The previous research of speech separation has significantly improved separation performance based on the time-domain method: encoder, separator, and decoder. Most research has focused on revising the architecture of the separator. In contrast, a single 1-D convolution layer and 1-D transposed convolution layer have been used as encoder and decoder, respectively. This study proposes deep encoder and decoder architectures, consisting of stacked 1-D convolution layers, 1-D transposed convolution layers, or residual blocks, for the time-domain speech separation. The intentions of revising them are to improve separation performance and overcome the tradeoff between separation performance and computational cost due to their stride by enhancing their mapping ability. We applied them to Conv-TasNet, the typical model in the time-domain speech separation. Our results indicate that the better separation performance is archived as the number of their layers increases and that changing the number of their layers from 1 to 12 results in more than 1 dB improvement of SI-SDR on WSJ0-2mix. Additionally, it is suggested that the encoder and decoder should be deeper, corresponding to their stride since their task may be more difficult as the stride becomes larger. This study represents the importance of improving these architectures as well as separators.

查看原文本刊更多论文

时域语音分离的深度编码器和解码器

以往的语音分离研究，基于时域方法:编码器、分隔符和解码器，显著提高了分离性能。大多数研究都集中在改进分离器的结构上。相比之下，单个1-D卷积层和1-D转置卷积层分别用作编码器和解码器。本研究提出了用于时域语音分离的深度编码器和解码器架构，包括堆叠的一维卷积层，一维转置卷积层或残差块。修改它们的目的是为了提高分离性能，通过增强它们的映射能力来克服分离性能和计算成本之间的权衡。并将其应用到时域语音分离的典型模型——卷积tasnet中。结果表明，随着层数的增加，SI-SDR在WSJ0-2mix上的分离性能得到了更好的提高，将层数从1层增加到12层，SI-SDR在WSJ0-2mix上的分离性能提高了1 dB以上。此外，建议编码器和解码器应该更深，与他们的步幅相对应，因为随着步幅变大，他们的任务可能会更困难。这项研究表明了改进这些体系结构以及分离器的重要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Mechanical Engineering Journal ENGINEERING, MECHANICAL-

自引率

20.00%

发文量