用于噪声混响环境下语音分离的深度编解码器双路径神经网络

IF 2.4 3区计算机科学

Journal on Audio Speech and Music Processing Pub Date : 2023-10-12 DOI:10.1186/s13636-023-00307-5

Chunxi Wang, Maoshen Jia, Xinfeng Zhang

{"title":"用于噪声混响环境下语音分离的深度编解码器双路径神经网络","authors":"Chunxi Wang, Maoshen Jia, Xinfeng Zhang","doi":"10.1186/s13636-023-00307-5","DOIUrl":null,"url":null,"abstract":"Abstract In recent years, the speaker-independent, single-channel speech separation problem has made significant progress with the development of deep neural networks (DNNs). However, separating the speech of each interested speaker from an environment that includes the speech of other speakers, background noise, and room reverberation remains challenging. In order to solve this problem, a speech separation method for a noisy reverberation environment is proposed. Firstly, the time-domain end-to-end network structure of a deep encoder/decoder dual-path neural network is introduced in this paper for speech separation. Secondly, to make the model not fall into local optimum during training, a loss function stretched optimal scale-invariant signal-to-noise ratio (SOSISNR) was proposed, inspired by the scale-invariant signal-to-noise ratio (SISNR). At the same time, in order to make the training more appropriate to the human auditory system, the joint loss function is extended based on short-time objective intelligibility (STOI). Thirdly, an alignment operation is proposed to reduce the influence of time delay caused by reverberation on separation performance. Combining the above methods, the subjective and objective evaluation metrics show that this study has better separation performance in complex sound field environments compared to the baseline methods.","PeriodicalId":49309,"journal":{"name":"Journal on Audio Speech and Music Processing","volume":"47 1","pages":"0"},"PeriodicalIF":2.4000,"publicationDate":"2023-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Deep encoder/decoder dual-path neural network for speech separation in noisy reverberation environments\",\"authors\":\"Chunxi Wang, Maoshen Jia, Xinfeng Zhang\",\"doi\":\"10.1186/s13636-023-00307-5\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Abstract In recent years, the speaker-independent, single-channel speech separation problem has made significant progress with the development of deep neural networks (DNNs). However, separating the speech of each interested speaker from an environment that includes the speech of other speakers, background noise, and room reverberation remains challenging. In order to solve this problem, a speech separation method for a noisy reverberation environment is proposed. Firstly, the time-domain end-to-end network structure of a deep encoder/decoder dual-path neural network is introduced in this paper for speech separation. Secondly, to make the model not fall into local optimum during training, a loss function stretched optimal scale-invariant signal-to-noise ratio (SOSISNR) was proposed, inspired by the scale-invariant signal-to-noise ratio (SISNR). At the same time, in order to make the training more appropriate to the human auditory system, the joint loss function is extended based on short-time objective intelligibility (STOI). Thirdly, an alignment operation is proposed to reduce the influence of time delay caused by reverberation on separation performance. Combining the above methods, the subjective and objective evaluation metrics show that this study has better separation performance in complex sound field environments compared to the baseline methods.\",\"PeriodicalId\":49309,\"journal\":{\"name\":\"Journal on Audio Speech and Music Processing\",\"volume\":\"47 1\",\"pages\":\"0\"},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2023-10-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal on Audio Speech and Music Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1186/s13636-023-00307-5\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal on Audio Speech and Music Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1186/s13636-023-00307-5","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

近年来，随着深度神经网络(dnn)的发展，与说话人无关的单通道语音分离问题取得了重大进展。然而，将每个感兴趣的说话者的讲话从包括其他说话者的讲话、背景噪声和房间混响的环境中分离出来仍然是一个挑战。为了解决这一问题，提出了一种噪声混响环境下的语音分离方法。首先，本文介绍了用于语音分离的深度编/解码器双路径神经网络的时域端到端网络结构。其次，为了使模型在训练过程中不陷入局部最优，受尺度不变信噪比(SISNR)的启发，提出了一种损失函数拉伸最优尺度不变信噪比(SOSISNR);同时，为了使训练更适合人类听觉系统，基于短时目标可理解度(STOI)对联合损失函数进行了扩展。再次，提出了一种对准操作，以减小混响引起的延时对分离性能的影响。综合上述方法，主客观评价指标表明，与基线方法相比，本研究在复杂声场环境下具有更好的分离性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Deep encoder/decoder dual-path neural network for speech separation in noisy reverberation environments

Abstract In recent years, the speaker-independent, single-channel speech separation problem has made significant progress with the development of deep neural networks (DNNs). However, separating the speech of each interested speaker from an environment that includes the speech of other speakers, background noise, and room reverberation remains challenging. In order to solve this problem, a speech separation method for a noisy reverberation environment is proposed. Firstly, the time-domain end-to-end network structure of a deep encoder/decoder dual-path neural network is introduced in this paper for speech separation. Secondly, to make the model not fall into local optimum during training, a loss function stretched optimal scale-invariant signal-to-noise ratio (SOSISNR) was proposed, inspired by the scale-invariant signal-to-noise ratio (SISNR). At the same time, in order to make the training more appropriate to the human auditory system, the joint loss function is extended based on short-time objective intelligibility (STOI). Thirdly, an alignment operation is proposed to reduce the influence of time delay caused by reverberation on separation performance. Combining the above methods, the subjective and objective evaluation metrics show that this study has better separation performance in complex sound field environments compared to the baseline methods.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal on Audio Speech and Music Processing Engineering-Electrical and Electronic Engineering

CiteScore

4.10

自引率

4.20%

发文量

期刊介绍： The aim of “EURASIP Journal on Audio, Speech, and Music Processing” is to bring together researchers, scientists and engineers working on the theory and applications of the processing of various audio signals, with a specific focus on speech and music. EURASIP Journal on Audio, Speech, and Music Processing will be an interdisciplinary journal for the dissemination of all basic and applied aspects of speech communication and audio processes.