Empirical Sampling from Latent Utterance-wise Evidence Model for Missing Data ASR based on Neural Encoder-Decoder Model

Interspeech Pub Date : 2022-09-18 DOI:10.21437/interspeech.2022-576

Ryu Takeda, Yui Sudo, K. Nakadai, Kazunori Komatani

{"title":"Empirical Sampling from Latent Utterance-wise Evidence Model for Missing Data ASR based on Neural Encoder-Decoder Model","authors":"Ryu Takeda, Yui Sudo, K. Nakadai, Kazunori Komatani","doi":"10.21437/interspeech.2022-576","DOIUrl":null,"url":null,"abstract":"Missing data automatic speech recognition (MD-ASR) can utilize the uncertainty of speech enhancement (SE) results without re-training of model parameters. Such uncertainty is represented by a probabilistic evidence model, and the design and the expectation calculation of it are important. Two problems arise in applying the MD approach to utterance-wise ASR based on neural encoder-decoder model: the high-dimensionality of an utterance-wise evidence model and the discontinuity among frames of generated samples in approximating the expectation with Monte-Carlo method. We propose new utterance-wise evidence models using a latent variable and an empirical method for sampling from them. The space of our latent model is restricted by simpler conditional probability density functions (pdfs) given the latent variable, which enables us to generate samples from the low-dimensional space in deterministic or stochastic way. Because the variable also works as a common smoothing parameter among simple pdfs, the generated samples are continuous among frames, which improves the ASR performance unlike frame-wise models. The uncertainty from a neural SE is also used as a component in our mixture pdf models. Experiments showed that the character error rate of the enhanced speech was further improved by 2.5 points on average with our MD-ASR using transformer model.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3789-3793"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Interspeech","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/interspeech.2022-576","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Missing data automatic speech recognition (MD-ASR) can utilize the uncertainty of speech enhancement (SE) results without re-training of model parameters. Such uncertainty is represented by a probabilistic evidence model, and the design and the expectation calculation of it are important. Two problems arise in applying the MD approach to utterance-wise ASR based on neural encoder-decoder model: the high-dimensionality of an utterance-wise evidence model and the discontinuity among frames of generated samples in approximating the expectation with Monte-Carlo method. We propose new utterance-wise evidence models using a latent variable and an empirical method for sampling from them. The space of our latent model is restricted by simpler conditional probability density functions (pdfs) given the latent variable, which enables us to generate samples from the low-dimensional space in deterministic or stochastic way. Because the variable also works as a common smoothing parameter among simple pdfs, the generated samples are continuous among frames, which improves the ASR performance unlike frame-wise models. The uncertainty from a neural SE is also used as a component in our mixture pdf models. Experiments showed that the character error rate of the enhanced speech was further improved by 2.5 points on average with our MD-ASR using transformer model.

查看原文本刊更多论文

基于神经编码器-解码器模型的基于潜在话语证据的缺失数据ASR经验抽样

缺失数据自动语音识别（MD-ASR）可以利用语音增强（SE）结果的不确定性，而无需重新训练模型参数。这种不确定性由概率证据模型表示，其设计和期望计算非常重要。在将MD方法应用于基于神经编码器-解码器模型的话语ASR时，出现了两个问题：话语证据模型的高维性和使用蒙特卡罗方法近似期望时生成的样本帧之间的不连续性。我们提出了新的话语证据模型，使用了一个潜在变量和一种从中取样的经验方法。我们的潜在模型的空间受到给定潜在变量的更简单的条件概率密度函数（pdfs）的限制，这使我们能够以确定性或随机的方式从低维空间生成样本。因为该变量在简单的pdf中也是一个常见的平滑参数，所以生成的样本在帧之间是连续的，这与逐帧模型不同，提高了ASR性能。神经SE的不确定性也被用作我们的混合pdf模型中的一个组成部分。实验表明，使用变换器模型的MD-ASR，增强语音的字符错误率平均提高了2.5个点。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Interspeech

自引率

0.00%

发文量