Multiresolution CNN for reverberant speech recognition

2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA) Pub Date : 2017-11-01 DOI:10.1109/ICSDA.2017.8384470

Sunchan Park, Yongwon Jeong, H. S. Kim

{"title":"Multiresolution CNN for reverberant speech recognition","authors":"Sunchan Park, Yongwon Jeong, H. S. Kim","doi":"10.1109/ICSDA.2017.8384470","DOIUrl":null,"url":null,"abstract":"The performance of automatic speech recognition (ASR) has been greatly improved by deep neural network (DNN) acoustic models. However, DNN-based systems still perform poorly in reverberant environments. Convolutional neural network (CNN) acoustic models showed lower word error rate (WER) in distant speech recognition than fully-connected DNN acoustic models. To improve the performance of reverberant speech recognition using CNN acoustic models, we propose the multiresolution CNN that has two separate streams: one is the wideband feature with wide-context window and the other is the narrowband feature with narrow-context window. The experiments on the ASR task of the REVERB challenge 2014 showed that the proposed multiresolution CNN based approach reduced the WER by 8.79% and 8.83% for the simulated test data and the real-condition test data, respectively, compared with the conventional CNN based method.","PeriodicalId":255147,"journal":{"name":"2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"18","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSDA.2017.8384470","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 18

Abstract

The performance of automatic speech recognition (ASR) has been greatly improved by deep neural network (DNN) acoustic models. However, DNN-based systems still perform poorly in reverberant environments. Convolutional neural network (CNN) acoustic models showed lower word error rate (WER) in distant speech recognition than fully-connected DNN acoustic models. To improve the performance of reverberant speech recognition using CNN acoustic models, we propose the multiresolution CNN that has two separate streams: one is the wideband feature with wide-context window and the other is the narrowband feature with narrow-context window. The experiments on the ASR task of the REVERB challenge 2014 showed that the proposed multiresolution CNN based approach reduced the WER by 8.79% and 8.83% for the simulated test data and the real-condition test data, respectively, compared with the conventional CNN based method.

查看原文本刊更多论文

多分辨率CNN混响语音识别

深度神经网络声学模型极大地提高了自动语音识别(ASR)的性能。然而，基于dnn的系统在混响环境中仍然表现不佳。卷积神经网络(CNN)声学模型在远端语音识别中的单词错误率(WER)低于全连接DNN声学模型。为了提高使用CNN声学模型进行混响语音识别的性能，我们提出了具有两个独立流的多分辨率CNN:一个是具有宽上下文窗口的宽带特征，另一个是具有窄上下文窗口的窄带特征。在REVERB challenge 2014的ASR任务上进行的实验表明，与传统的基于CNN的方法相比，本文提出的基于多分辨率CNN的方法对模拟测试数据和真实条件测试数据的WER分别降低了8.79%和8.83%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)

自引率

0.00%

发文量