Discriminative piecewise linear transformation based on deep learning for noise robust automatic speech recognition

2013 IEEE Workshop on Automatic Speech Recognition and Understanding Pub Date : 2013-12-01 DOI:10.1109/ASRU.2013.6707755

Yosuke Kashiwagi, D. Saito, N. Minematsu, K. Hirose

{"title":"Discriminative piecewise linear transformation based on deep learning for noise robust automatic speech recognition","authors":"Yosuke Kashiwagi, D. Saito, N. Minematsu, K. Hirose","doi":"10.1109/ASRU.2013.6707755","DOIUrl":null,"url":null,"abstract":"In this paper, we propose the use of deep neural networks to expand conventional methods of statistical feature enhancement based on piecewise linear transformation. Stereo-based piecewise linear compensation for environments (SPLICE), which is a powerful statistical approach for feature enhancement, models the probabilistic distribution of input noisy features as a mixture of Gaussians. However, soft assignment of an input vector to divided regions is sometimes done inadequately and the vector comes to go through inadequate conversion. Especially when conversion has to be linear, the conversion performance will be easily degraded. Feature enhancement using neural networks is another powerful approach which can directly model a non-linear relationship between noisy and clean feature spaces. In this case, however, it tends to suffer from over-fitting problems. In this paper, we attempt to mitigate this problem by reducing the number of model parameters to estimate. Our neural network is trained whose output layer is associated with the states in the clean feature space, not in the noisy feature space. This strategy makes the size of the output layer independent of the kind of a given noisy environment. Firstly, we characterize the distribution of clean features as a Gaussian mixture model and then, by using deep neural networks, estimate discriminatively the state in the clean space that an input noisy feature corresponds to. Experimental evaluations using the Aurora 2 dataset demonstrate that our proposed method has the best performance compared to conventional methods.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASRU.2013.6707755","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

Abstract

In this paper, we propose the use of deep neural networks to expand conventional methods of statistical feature enhancement based on piecewise linear transformation. Stereo-based piecewise linear compensation for environments (SPLICE), which is a powerful statistical approach for feature enhancement, models the probabilistic distribution of input noisy features as a mixture of Gaussians. However, soft assignment of an input vector to divided regions is sometimes done inadequately and the vector comes to go through inadequate conversion. Especially when conversion has to be linear, the conversion performance will be easily degraded. Feature enhancement using neural networks is another powerful approach which can directly model a non-linear relationship between noisy and clean feature spaces. In this case, however, it tends to suffer from over-fitting problems. In this paper, we attempt to mitigate this problem by reducing the number of model parameters to estimate. Our neural network is trained whose output layer is associated with the states in the clean feature space, not in the noisy feature space. This strategy makes the size of the output layer independent of the kind of a given noisy environment. Firstly, we characterize the distribution of clean features as a Gaussian mixture model and then, by using deep neural networks, estimate discriminatively the state in the clean space that an input noisy feature corresponds to. Experimental evaluations using the Aurora 2 dataset demonstrate that our proposed method has the best performance compared to conventional methods.

查看原文本刊更多论文

基于深度学习的判别分段线性变换噪声鲁棒自动语音识别

在本文中，我们提出使用深度神经网络来扩展传统的基于分段线性变换的统计特征增强方法。基于立体的环境分段线性补偿(SPLICE)是一种功能强大的特征增强统计方法，它将输入噪声特征的概率分布建模为高斯分布的混合。然而，输入向量对分割区域的软赋值有时做得不充分，向量会经过不充分的转换。特别是当转换必须是线性时，转换性能很容易下降。利用神经网络的特征增强是另一种强大的方法，它可以直接模拟噪声和干净特征空间之间的非线性关系。然而，在这种情况下，它往往会出现过拟合问题。在本文中，我们试图通过减少模型参数估计的数量来缓解这个问题。我们训练的神经网络的输出层与干净特征空间中的状态相关联，而不是在有噪声的特征空间中。这种策略使得输出层的大小与给定噪声环境的类型无关。首先，我们将干净特征的分布描述为高斯混合模型，然后通过深度神经网络判别估计输入噪声特征对应的干净空间状态。使用极光2号数据集的实验评估表明，与传统方法相比，我们提出的方法具有最佳性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2013 IEEE Workshop on Automatic Speech Recognition and Understanding

自引率

0.00%

发文量