Multi-resolution spectral input for convolutional neural network-based speech recognition

L. Tóth
{"title":"Multi-resolution spectral input for convolutional neural network-based speech recognition","authors":"L. Tóth","doi":"10.1109/SPED.2017.7990430","DOIUrl":null,"url":null,"abstract":"The convolutional deep neural network component applied frequently in current speech recognizers is trained on a context of consecutive spectral feature vectors. Here, we investigate whether we can extend the time span of this input and reduce the number of spectral features at the same time by using a multi-resolution spectrum as input. In the proposed multi-resolution scheme, the network processes the nearby neighbors of the actual frame using the standard resolution, while it applies a gradually coarser resolution for more distant frames. Using this solution, we managed to extend the input of our network to a time context of 45 frames without increasing the number of input features, and we also achieved a relative error rate reduction of 3–4% compared to the conventional high-resolution representation. We report a phone error rate of 17.0% on the TIMIT core test set, which is competitive with the best scores published on this data set.","PeriodicalId":345314,"journal":{"name":"2017 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SPED.2017.7990430","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

Abstract

The convolutional deep neural network component applied frequently in current speech recognizers is trained on a context of consecutive spectral feature vectors. Here, we investigate whether we can extend the time span of this input and reduce the number of spectral features at the same time by using a multi-resolution spectrum as input. In the proposed multi-resolution scheme, the network processes the nearby neighbors of the actual frame using the standard resolution, while it applies a gradually coarser resolution for more distant frames. Using this solution, we managed to extend the input of our network to a time context of 45 frames without increasing the number of input features, and we also achieved a relative error rate reduction of 3–4% compared to the conventional high-resolution representation. We report a phone error rate of 17.0% on the TIMIT core test set, which is competitive with the best scores published on this data set.
基于卷积神经网络的多分辨率频谱输入语音识别
当前语音识别中常用的卷积深度神经网络组件是在连续谱特征向量的背景下进行训练的。在这里,我们研究是否可以通过使用多分辨率光谱作为输入来延长该输入的时间跨度,同时减少光谱特征的数量。在提出的多分辨率方案中,网络使用标准分辨率处理实际帧的近邻,而对较远的帧应用逐渐粗糙的分辨率。使用这种解决方案,我们设法将网络的输入扩展到45帧的时间上下文,而不增加输入特征的数量,并且与传统的高分辨率表示相比,我们还实现了相对错误率降低3-4%。我们在TIMIT核心测试集上报告了17.0%的电话错误率,这与该数据集上发布的最佳分数具有竞争力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信