在深度神经网络框架下对滤波器组和增量学习的改进

2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Pub Date : 2014-05-04 DOI:10.1109/ICASSP.2014.6854925

Tara N. Sainath, Brian Kingsbury, Abdel-rahman Mohamed, G. Saon, B. Ramabhadran

{"title":"在深度神经网络框架下对滤波器组和增量学习的改进","authors":"Tara N. Sainath, Brian Kingsbury, Abdel-rahman Mohamed, G. Saon, B. Ramabhadran","doi":"10.1109/ICASSP.2014.6854925","DOIUrl":null,"url":null,"abstract":"Many features used in speech recognition tasks are hand-crafted and are not always related to the objective at hand, that is minimizing word error rate. Recently, we showed that replacing a perceptually motivated mel-filter bank with a filter bank layer that is learned jointly with the rest of a deep neural network was promising. In this paper, we extend filter learning to a speaker-adapted, state-of-the-art system. First, we incorporate delta learning into the filter learning framework. Second, we incorporate various speaker adaptation techniques, including VTLN warping and speaker identity features. On a 50-hour English Broadcast News task, we show that we can achieve a 5% relative improvement in word error rate (WER) using the filter and delta learning, compared to having a fixed set of filters and deltas. Furthermore, after speaker adaptation, we find that filter and delta learning allows for a 3% relative improvement in WER compared to a state-of-the-art CNN.","PeriodicalId":6545,"journal":{"name":"2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"54 1","pages":"6839-6843"},"PeriodicalIF":0.0000,"publicationDate":"2014-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":"{\"title\":\"Improvements to filterbank and delta learning within a deep neural network framework\",\"authors\":\"Tara N. Sainath, Brian Kingsbury, Abdel-rahman Mohamed, G. Saon, B. Ramabhadran\",\"doi\":\"10.1109/ICASSP.2014.6854925\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Many features used in speech recognition tasks are hand-crafted and are not always related to the objective at hand, that is minimizing word error rate. Recently, we showed that replacing a perceptually motivated mel-filter bank with a filter bank layer that is learned jointly with the rest of a deep neural network was promising. In this paper, we extend filter learning to a speaker-adapted, state-of-the-art system. First, we incorporate delta learning into the filter learning framework. Second, we incorporate various speaker adaptation techniques, including VTLN warping and speaker identity features. On a 50-hour English Broadcast News task, we show that we can achieve a 5% relative improvement in word error rate (WER) using the filter and delta learning, compared to having a fixed set of filters and deltas. Furthermore, after speaker adaptation, we find that filter and delta learning allows for a 3% relative improvement in WER compared to a state-of-the-art CNN.\",\"PeriodicalId\":6545,\"journal\":{\"name\":\"2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)\",\"volume\":\"54 1\",\"pages\":\"6839-6843\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-05-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"12\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICASSP.2014.6854925\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICASSP.2014.6854925","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

摘要

语音识别任务中使用的许多特征都是手工制作的，并不总是与手边的目标相关，即最小化单词错误率。最近，我们表明，用与深度神经网络的其余部分共同学习的滤波器库层取代感知激励的mel-filter bank是有希望的。在本文中，我们将滤波器学习扩展到一个自适应的、最先进的系统。首先，我们将增量学习合并到过滤器学习框架中。其次，我们结合了各种说话人自适应技术，包括VTLN翘曲和说话人身份特征。在一个50小时的英语广播新闻任务中，我们表明，与使用一组固定的过滤器和delta相比，使用过滤器和delta学习可以在单词错误率(WER)方面实现5%的相对改进。此外，在演讲者适应之后，我们发现与最先进的CNN相比，过滤器和delta学习允许在WER中相对提高3%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Improvements to filterbank and delta learning within a deep neural network framework

Many features used in speech recognition tasks are hand-crafted and are not always related to the objective at hand, that is minimizing word error rate. Recently, we showed that replacing a perceptually motivated mel-filter bank with a filter bank layer that is learned jointly with the rest of a deep neural network was promising. In this paper, we extend filter learning to a speaker-adapted, state-of-the-art system. First, we incorporate delta learning into the filter learning framework. Second, we incorporate various speaker adaptation techniques, including VTLN warping and speaker identity features. On a 50-hour English Broadcast News task, we show that we can achieve a 5% relative improvement in word error rate (WER) using the filter and delta learning, compared to having a fixed set of filters and deltas. Furthermore, after speaker adaptation, we find that filter and delta learning allows for a 3% relative improvement in WER compared to a state-of-the-art CNN.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

自引率

0.00%

发文量