Multi-resolution spectral input for convolutional neural network-based speech recognition

2017 International Conference on Speech Technology and Human-Computer Dialogue (SpeD) Pub Date : 2017-07-06 DOI:10.1109/SPED.2017.7990430

L. Tóth

引用次数: 5

Abstract

The convolutional deep neural network component applied frequently in current speech recognizers is trained on a context of consecutive spectral feature vectors. Here, we investigate whether we can extend the time span of this input and reduce the number of spectral features at the same time by using a multi-resolution spectrum as input. In the proposed multi-resolution scheme, the network processes the nearby neighbors of the actual frame using the standard resolution, while it applies a gradually coarser resolution for more distant frames. Using this solution, we managed to extend the input of our network to a time context of 45 frames without increasing the number of input features, and we also achieved a relative error rate reduction of 3–4% compared to the conventional high-resolution representation. We report a phone error rate of 17.0% on the TIMIT core test set, which is competitive with the best scores published on this data set.

查看原文本刊更多论文

基于卷积神经网络的多分辨率频谱输入语音识别

当前语音识别中常用的卷积深度神经网络组件是在连续谱特征向量的背景下进行训练的。在这里，我们研究是否可以通过使用多分辨率光谱作为输入来延长该输入的时间跨度，同时减少光谱特征的数量。在提出的多分辨率方案中，网络使用标准分辨率处理实际帧的近邻，而对较远的帧应用逐渐粗糙的分辨率。使用这种解决方案，我们设法将网络的输入扩展到45帧的时间上下文，而不增加输入特征的数量，并且与传统的高分辨率表示相比，我们还实现了相对错误率降低3-4%。我们在TIMIT核心测试集上报告了17.0%的电话错误率，这与该数据集上发布的最佳分数具有竞争力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)

自引率

0.00%

发文量