面向鲁棒语音识别的前端和后端深度神经网络联合训练

Tian Gao, Jun Du, Lirong Dai, Chin-Hui Lee
{"title":"面向鲁棒语音识别的前端和后端深度神经网络联合训练","authors":"Tian Gao, Jun Du, Lirong Dai, Chin-Hui Lee","doi":"10.1109/ICASSP.2015.7178797","DOIUrl":null,"url":null,"abstract":"Based on the recently proposed speech pre-processing front-end with deep neural networks (DNNs), we first investigate different feature mapping directly from noisy speech via DNN for robust speech recognition. Next, we propose to jointly train a single DNN for both feature mapping and acoustic modeling. In the end, we show that the word error rate (WER) of the jointly trained system could be significantly reduced by the fusion of multiple DNN pre-processing systems which implies that features obtained from different domains of the DNN-enhanced speech signals are strongly complementary. Testing on the Aurora4 noisy speech recognition task our best system with multi-condition training can achieves an average WER of 10.3%, yielding a relative reduction of 16.3% over our previous DNN pre-processing only system with a WER of 12.3%. To the best of our knowledge, this represents the best published result on the Aurora4 task without using any adaptation techniques.","PeriodicalId":117666,"journal":{"name":"2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"73","resultStr":"{\"title\":\"Joint training of front-end and back-end deep neural networks for robust speech recognition\",\"authors\":\"Tian Gao, Jun Du, Lirong Dai, Chin-Hui Lee\",\"doi\":\"10.1109/ICASSP.2015.7178797\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Based on the recently proposed speech pre-processing front-end with deep neural networks (DNNs), we first investigate different feature mapping directly from noisy speech via DNN for robust speech recognition. Next, we propose to jointly train a single DNN for both feature mapping and acoustic modeling. In the end, we show that the word error rate (WER) of the jointly trained system could be significantly reduced by the fusion of multiple DNN pre-processing systems which implies that features obtained from different domains of the DNN-enhanced speech signals are strongly complementary. Testing on the Aurora4 noisy speech recognition task our best system with multi-condition training can achieves an average WER of 10.3%, yielding a relative reduction of 16.3% over our previous DNN pre-processing only system with a WER of 12.3%. To the best of our knowledge, this represents the best published result on the Aurora4 task without using any adaptation techniques.\",\"PeriodicalId\":117666,\"journal\":{\"name\":\"2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-04-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"73\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICASSP.2015.7178797\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICASSP.2015.7178797","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 73

摘要

基于最近提出的基于深度神经网络(DNN)的语音预处理前端,我们首先研究了通过深度神经网络直接从噪声语音中映射不同的特征以实现鲁棒语音识别。接下来,我们建议联合训练一个DNN用于特征映射和声学建模。最后,我们证明了通过融合多个DNN预处理系统可以显著降低联合训练系统的单词错误率(WER),这意味着从DNN增强语音信号的不同域获得的特征是强互补的。在Aurora4噪声语音识别任务的测试中,我们最好的多条件训练系统可以达到10.3%的平均WER,比我们之前只有DNN预处理的12.3%的系统相对降低了16.3%。据我们所知,这是在没有使用任何自适应技术的情况下发表的关于Aurora4任务的最佳结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Joint training of front-end and back-end deep neural networks for robust speech recognition
Based on the recently proposed speech pre-processing front-end with deep neural networks (DNNs), we first investigate different feature mapping directly from noisy speech via DNN for robust speech recognition. Next, we propose to jointly train a single DNN for both feature mapping and acoustic modeling. In the end, we show that the word error rate (WER) of the jointly trained system could be significantly reduced by the fusion of multiple DNN pre-processing systems which implies that features obtained from different domains of the DNN-enhanced speech signals are strongly complementary. Testing on the Aurora4 noisy speech recognition task our best system with multi-condition training can achieves an average WER of 10.3%, yielding a relative reduction of 16.3% over our previous DNN pre-processing only system with a WER of 12.3%. To the best of our knowledge, this represents the best published result on the Aurora4 task without using any adaptation techniques.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信