基于深度神经网络的语音识别折衷特征归一化方法*

Phonetics and Speech Sciences Pub Date : 2020-09-01 DOI:10.13064/ksss.2020.12.3.065

M. Kim, H. S. Kim

{"title":"基于深度神经网络的语音识别折衷特征归一化方法*","authors":"M. Kim, H. S. Kim","doi":"10.13064/ksss.2020.12.3.065","DOIUrl":null,"url":null,"abstract":"Feature normalization is a method to reduce the effect of environmental mismatch between the training and test conditions through the normalization of statistical characteristics of acoustic feature parameters. It demonstrates excellent performance improvement in the traditional Gaussian mixture model-hidden Markov model (GMM-HMM)-based speech recognition system. However, in a deep neural network (DNN)-based speech recognition system, minimizing the effects of environmental mismatch does not necessarily lead to the best performance improvement. In this paper, we attribute the cause of this phenomenon to information loss due to excessive feature normalization. We investigate whether there is a feature normalization method that maximizes the speech recognition performance by properly reducing the impact of environmental mismatch, while preserving useful information for training acoustic models. To this end, we introduce the mean and exponentiated variance normalization (MEVN), which is a compromise between the mean normalization (MN) and the mean and variance normalization (MVN), and compare the performance of DNN-based speech recognition system in noisy and reverberant environments according to the degree of variance normalization. Experimental results reveal that a slight performance improvement is obtained with the MEVN over the MN and the MVN, depending on the degree of variance normalization.","PeriodicalId":255285,"journal":{"name":"Phonetics and Speech Sciences","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Compromised feature normalization method for deep neural network\\n based speech recognition*\",\"authors\":\"M. Kim, H. S. Kim\",\"doi\":\"10.13064/ksss.2020.12.3.065\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Feature normalization is a method to reduce the effect of environmental mismatch between the training and test conditions through the normalization of statistical characteristics of acoustic feature parameters. It demonstrates excellent performance improvement in the traditional Gaussian mixture model-hidden Markov model (GMM-HMM)-based speech recognition system. However, in a deep neural network (DNN)-based speech recognition system, minimizing the effects of environmental mismatch does not necessarily lead to the best performance improvement. In this paper, we attribute the cause of this phenomenon to information loss due to excessive feature normalization. We investigate whether there is a feature normalization method that maximizes the speech recognition performance by properly reducing the impact of environmental mismatch, while preserving useful information for training acoustic models. To this end, we introduce the mean and exponentiated variance normalization (MEVN), which is a compromise between the mean normalization (MN) and the mean and variance normalization (MVN), and compare the performance of DNN-based speech recognition system in noisy and reverberant environments according to the degree of variance normalization. Experimental results reveal that a slight performance improvement is obtained with the MEVN over the MN and the MVN, depending on the degree of variance normalization.\",\"PeriodicalId\":255285,\"journal\":{\"name\":\"Phonetics and Speech Sciences\",\"volume\":\"10 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Phonetics and Speech Sciences\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.13064/ksss.2020.12.3.065\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Phonetics and Speech Sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.13064/ksss.2020.12.3.065","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

特征归一化是通过对声学特征参数的统计特征进行归一化，减少训练条件和测试条件之间环境不匹配的影响。在传统的基于高斯混合模型-隐马尔可夫模型(GMM-HMM)的语音识别系统的基础上，得到了显著的性能提升。然而，在基于深度神经网络(DNN)的语音识别系统中，将环境不匹配的影响最小化并不一定会导致最佳的性能提高。在本文中，我们将这种现象的原因归结为过度特征归一化导致的信息丢失。我们研究是否存在一种特征归一化方法，通过适当减少环境不匹配的影响来最大化语音识别性能，同时为训练声学模型保留有用的信息。为此，我们引入均值和指数方差归一化(MEVN)，这是均值归一化(MN)和均值和方差归一化(MVN)之间的折衷，并根据方差归一化程度比较了基于dnn的语音识别系统在噪声和混响环境下的性能。实验结果表明，根据方差归一化的程度，MEVN比MN和MVN的性能有轻微的提高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Compromised feature normalization method for deep neural network based speech recognition*

Feature normalization is a method to reduce the effect of environmental mismatch between the training and test conditions through the normalization of statistical characteristics of acoustic feature parameters. It demonstrates excellent performance improvement in the traditional Gaussian mixture model-hidden Markov model (GMM-HMM)-based speech recognition system. However, in a deep neural network (DNN)-based speech recognition system, minimizing the effects of environmental mismatch does not necessarily lead to the best performance improvement. In this paper, we attribute the cause of this phenomenon to information loss due to excessive feature normalization. We investigate whether there is a feature normalization method that maximizes the speech recognition performance by properly reducing the impact of environmental mismatch, while preserving useful information for training acoustic models. To this end, we introduce the mean and exponentiated variance normalization (MEVN), which is a compromise between the mean normalization (MN) and the mean and variance normalization (MVN), and compare the performance of DNN-based speech recognition system in noisy and reverberant environments according to the degree of variance normalization. Experimental results reveal that a slight performance improvement is obtained with the MEVN over the MN and the MVN, depending on the degree of variance normalization.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Phonetics and Speech Sciences

自引率

0.00%

发文量