基于深度神经网络的多通道Itakura Saito距离最小化

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Pub Date : 2019-05-12 DOI:10.1109/ICASSP.2019.8683410

M. Togami

{"title":"基于深度神经网络的多通道Itakura Saito距离最小化","authors":"M. Togami","doi":"10.1109/ICASSP.2019.8683410","DOIUrl":null,"url":null,"abstract":"A multi-channel speech source separation with a deep neural network which optimizes not only the time-varying variance of a speech source but also the multi-channel spatial covariance matrix jointly without any iterative optimization method is shown. Instead of a loss function which does not evaluate spatial characteristics of the output signal, the proposed method utilizes a loss function based on minimization of multi-channel Itakura-Saito Distance (MISD), which evaluates spatial characteristics of the output signal. The cost function based on MISD is calculated by the estimated posterior probability density function (PDF) of each speech source based on a time-varying Gaussian distribution model. The loss function of the neural network and the PDF of each speech source that is assumed in multi-channel speech source separation are consistent with each other. As a neural-network architecture, the proposed method utilizes multiple bidirectional long-short term memory (BLSTM) layers. The BLSTM layers and the successive complex-valued signal processing are jointly optimized in the training phase. Experimental results show that more accurately separated speech signal can be obtained with neural network parameters optimized based on the proposed MISD minimization than that with neural network parameters optimized based on loss functions without spatial covariance matrix evaluation.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"109 1","pages":"536-540"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":"{\"title\":\"Multi-channel Itakura Saito Distance Minimization with Deep Neural Network\",\"authors\":\"M. Togami\",\"doi\":\"10.1109/ICASSP.2019.8683410\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A multi-channel speech source separation with a deep neural network which optimizes not only the time-varying variance of a speech source but also the multi-channel spatial covariance matrix jointly without any iterative optimization method is shown. Instead of a loss function which does not evaluate spatial characteristics of the output signal, the proposed method utilizes a loss function based on minimization of multi-channel Itakura-Saito Distance (MISD), which evaluates spatial characteristics of the output signal. The cost function based on MISD is calculated by the estimated posterior probability density function (PDF) of each speech source based on a time-varying Gaussian distribution model. The loss function of the neural network and the PDF of each speech source that is assumed in multi-channel speech source separation are consistent with each other. As a neural-network architecture, the proposed method utilizes multiple bidirectional long-short term memory (BLSTM) layers. The BLSTM layers and the successive complex-valued signal processing are jointly optimized in the training phase. Experimental results show that more accurately separated speech signal can be obtained with neural network parameters optimized based on the proposed MISD minimization than that with neural network parameters optimized based on loss functions without spatial covariance matrix evaluation.\",\"PeriodicalId\":13203,\"journal\":{\"name\":\"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)\",\"volume\":\"109 1\",\"pages\":\"536-540\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-05-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"12\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICASSP.2019.8683410\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICASSP.2019.8683410","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

摘要

提出了一种基于深度神经网络的多通道语音源分离方法，该方法不仅对语音源的时变方差进行优化，而且对多通道空间协方差矩阵进行联合优化，无需任何迭代优化方法。该方法利用基于多通道Itakura-Saito距离最小化(MISD)的损失函数来评估输出信号的空间特征，而不是不评估输出信号的空间特征的损失函数。基于MISD的代价函数是根据时变高斯分布模型估计每个语音源的后验概率密度函数(PDF)。神经网络的损失函数与多通道语音源分离中假设的每个语音源的PDF是一致的。作为一种神经网络结构，该方法利用了多个双向长短期记忆(BLSTM)层。在训练阶段对BLSTM层和逐次复值信号处理进行联合优化。实验结果表明，与不进行空间协方差矩阵评估的基于损失函数的神经网络参数优化方法相比，基于MISD最小化的神经网络参数优化方法可以更准确地分离语音信号。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Multi-channel Itakura Saito Distance Minimization with Deep Neural Network

A multi-channel speech source separation with a deep neural network which optimizes not only the time-varying variance of a speech source but also the multi-channel spatial covariance matrix jointly without any iterative optimization method is shown. Instead of a loss function which does not evaluate spatial characteristics of the output signal, the proposed method utilizes a loss function based on minimization of multi-channel Itakura-Saito Distance (MISD), which evaluates spatial characteristics of the output signal. The cost function based on MISD is calculated by the estimated posterior probability density function (PDF) of each speech source based on a time-varying Gaussian distribution model. The loss function of the neural network and the PDF of each speech source that is assumed in multi-channel speech source separation are consistent with each other. As a neural-network architecture, the proposed method utilizes multiple bidirectional long-short term memory (BLSTM) layers. The BLSTM layers and the successive complex-valued signal processing are jointly optimized in the training phase. Experimental results show that more accurately separated speech signal can be obtained with neural network parameters optimized based on the proposed MISD minimization than that with neural network parameters optimized based on loss functions without spatial covariance matrix evaluation.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

自引率

0.00%

发文量