2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)最新文献_第2页

Gated convolutional networks based hybrid acoustic models for low resource speech recognition 基于门控卷积网络的混合声学模型用于低资源语音识别

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268930

Jian Kang, Weiqiang Zhang, Jia Liu

{"title":"Gated convolutional networks based hybrid acoustic models for low resource speech recognition","authors":"Jian Kang, Weiqiang Zhang, Jia Liu","doi":"10.1109/ASRU.2017.8268930","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268930","url":null,"abstract":"In acoustic modeling for large vocabulary speech recognition, recurrent neural networks (RNN) have shown great abilities to model temporal dependencies. However, the performance of RNN is not prominent in resource limited tasks, even worse than the traditional feedforward neural networks (FNN). Furthermore, training time for RNN is much more than that for FNN. In recent years, some novel models are provided. They use non-recurrent architectures to model long term dependencies. In these architectures, they show that using gate mechanism is an effective method to construct acoustic models. On the other hand, it has been proved that using convolution operation is a good method to learn acoustic features. We hope to take advantages of both these two methods. In this paper we present a gated convolutional approach to low resource speech recognition tasks. The gated convolutional networks use convolutional architectures to learn input features and a gate to control information. Experiments are conducted on the OpenKWS, a series of low resource keyword search evaluations. From the results, the gated convolutional networks relatively decrease the WER about 6% over the baseline LSTM models, 5% over the DNN models and 3% over the BLSTM models. In addition, the new models accelerate the learning speed by more than 1.8 and 3.2 times compared to that of the baseline LSTM and BLSTM models.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123352432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Deep learning methods for unsupervised acoustic modeling — Leap submission to ZeroSpeech challenge 2017 无监督声学建模的深度学习方法- Leap提交给2017年零语音挑战

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8269013

T. Ansari, Rajath Kumar, Sonali Singh, Sriram Ganapathy

{"title":"Deep learning methods for unsupervised acoustic modeling — Leap submission to ZeroSpeech challenge 2017","authors":"T. Ansari, Rajath Kumar, Sonali Singh, Sriram Ganapathy","doi":"10.1109/ASRU.2017.8269013","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8269013","url":null,"abstract":"In this paper, we present our system submission to the ZeroSpeech 2017 Challenge. The track1 of this challenge is intended to develop language independent speech representations that provide the least pairwise ABX distance computed for within speaker and across speaker pairs of spoken words. We investigate two approaches based on deep learning methods for unsupervised modeling. In the first approach, a deep neural network (DNN) is trained on the posteriors of mixture component indices obtained from training a Gaussian mixture model (GMM)-UBM. In the second approach, we develop a similar hidden Markov model (HMM) based DNN model to learn the unsupervised acoustic units provided by HMM state alignments. In addition, we also develop a deep autoencoder which learns language independent embeddings of speech to train the HMM-DNN model. Both the approaches do not use any labeled training data or require any supervision. We perform several experiments using the ZeroSpeech 2017 corpus with the minimal pair ABX error measure. In these experiments, we find that the two proposed approaches significantly improve over the baseline system using MFCC features (average relative improvements of 30–40%). Furthermore, the system combination of the two proposed approaches improves the performance over the best individual system.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"128 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132368206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Denotation extraction for interactive learning in dialogue systems 对话系统中交互式学习的外延提取

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268976

Miroslav Vodolán, Filip Jurcícek

引用次数: 0

Incremental training and constructing the very deep convolutional residual network acoustic models 增量训练和构造极深卷积残差网络声学模型

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268939

Sheng Li, Xugang Lu, Peng Shen, R. Takashima, Tatsuya Kawahara, H. Kawai

{"title":"Incremental training and constructing the very deep convolutional residual network acoustic models","authors":"Sheng Li, Xugang Lu, Peng Shen, R. Takashima, Tatsuya Kawahara, H. Kawai","doi":"10.1109/ASRU.2017.8268939","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268939","url":null,"abstract":"Inspired by the successful applications in image recognition, the very deep convolutional residual network (ResNet) based model has been applied in automatic speech recognition (ASR). However, the computational load is heavy for training the ResNet with a large quantity of data. In this paper, we propose an incremental model training framework to accelerate the training process of the ResNet. The incremental model training framework is based on the unequal importance of each layer and connection in the ResNet. The modules with important layers and connections are regarded as a skeleton model, while those left are regarded as an auxiliary model. The total depth of the skeleton model is quite shallow compared to the very deep full network. In our incremental training, the skeleton model is first trained with the full training data set. Other layers and connections belonging to the auxiliary model are gradually attached to the skeleton model and tuned. Our experiments showed that the proposed incremental training obtained comparable performances and faster training speed compared with the model training as a whole without consideration of the different importance of each layer.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124639980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Learning modality-invariant representations for speech and images 学习语音和图像的模态不变表示

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268967

K. Leidal, David F. Harwath, James R. Glass

引用次数: 25

Subband wavenet with overlapped single-sideband filterbanks 具有重叠的单边带滤波器组的子带波

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8269005

T. Okamoto, Kentaro Tachibana, T. Toda, Y. Shiga, H. Kawai

引用次数: 13

Grounded language understanding for manipulation instructions using GAN-based classification 基于gan分类的操作指令的基础语言理解

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268980

K. Sugiura, H. Kawai

引用次数: 7

Tackling unseen acoustic conditions in query-by-example search using time and frequency convolution for multilingual deep bottleneck features 针对多语言深度瓶颈特征，利用时间和频率卷积处理逐例查询搜索中未见的声学条件

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268915

Julien van Hout, V. Mitra, H. Franco, C. Bartels, D. Vergyri

{"title":"Tackling unseen acoustic conditions in query-by-example search using time and frequency convolution for multilingual deep bottleneck features","authors":"Julien van Hout, V. Mitra, H. Franco, C. Bartels, D. Vergyri","doi":"10.1109/ASRU.2017.8268915","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268915","url":null,"abstract":"Standard keyword spotting based on Automatic Speech Recognition (ASR) cannot be used on low-and no-resource languages due to lack of annotated data and/or linguistic resources. In recent years, query-by-example (QbE) has emerged as an alternate way to enroll and find spoken queries in large audio corpora, yet mismatched and unseen acoustic conditions remain a difficult challenge given the lack of enrollment data. This paper revisits two neural network architectures developed for noise and channel-robust ASR, and applies them to building a state-of-art multilingual QbE system. By applying convolution in time or frequency across the spectrum, those convolutional bottlenecks learn more discriminative deep bottleneck features. In conjunction with dynamic time warping (DTW), these features enable robust QbE systems. We use the MediaEval 2014 QUESST data to evaluate robustness against language and channel mismatches, and add several levels of artificial noise to the data to evaluate performance in degraded acoustic environments. We also assess performance on an Air Traffic Control QbE task with more realistic and higher levels of distortion in the push-to-talk domain.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117341082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Error detection of grapheme-to-phoneme conversion in text-to-speech synthesis using speech signal and lexical context 基于语音信号和词汇语境的文本-语音合成中字素-音素转换的错误检测

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8269004

Kevin Vythelingum, Y. Estève, O. Rosec

{"title":"Error detection of grapheme-to-phoneme conversion in text-to-speech synthesis using speech signal and lexical context","authors":"Kevin Vythelingum, Y. Estève, O. Rosec","doi":"10.1109/ASRU.2017.8269004","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8269004","url":null,"abstract":"In unit selection text-to-speech synthesis, voice creation involved a phonemic transcription of read speech. This is produced by an automatic grapheme-to-phoneme conversion of the text read, followed by a manual correction. Although grapheme-to-phoneme conversion makes few errors, the manual correction is time consuming as every generated phoneme should be checked. We propose a method to automatically detect grapheme-to-phoneme conversion errors by comparing contrastives phonemisation hypothesis. A lattice-based forced alignment system is implemented, allowing for signal-dependent phonemisation. We implement also a sequence-to-sequence neural network model to obtain a context-dependent grapheme-to-phoneme conversion. On a French dataset, we show that we can detect to 86.3% of the errors made by a commercial grapheme-to-phoneme system. Moreover, the amount of data annotated as erroneous is kept under 10% of the total evaluation data. The time spent for phoneme manual checking can thus been drastically reduced without decreasing significantly the phonemic transcription quality.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115626251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Learning speaker representation for neural network based multichannel speaker extraction 学习基于神经网络的多通道说话人提取的说话人表示

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268910

Kateřina Žmolíková, Marc Delcroix, K. Kinoshita, T. Higuchi, A. Ogawa, T. Nakatani

{"title":"Learning speaker representation for neural network based multichannel speaker extraction","authors":"Kateřina Žmolíková, Marc Delcroix, K. Kinoshita, T. Higuchi, A. Ogawa, T. Nakatani","doi":"10.1109/ASRU.2017.8268910","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268910","url":null,"abstract":"Recently, schemes employing deep neural networks (DNNs) for extracting speech from noisy observation have demonstrated great potential for noise robust automatic speech recognition. However, these schemes are not well suited when the interfering noise is another speaker. To enable extracting a target speaker from a mixture of speakers, we have recently proposed to inform the neural network using speaker information extracted from an adaptation utterance from the same speaker. In our previous work, we explored ways how to inform the network about the speaker and found a speaker adaptive layer approach to be suitable for this task. In our experiments, we used speaker features designed for speaker recognition tasks as the additional speaker information, which may not be optimal for the speaker extraction task. In this paper, we propose a usage of a sequence summarizing scheme enabling to learn the speaker representation jointly with the network. Furthermore, we extend the previous experiments to demonstrate the potential of our proposed method as a front-end for speech recognition and explore the effect of additional noise on the performance of the method.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124501345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 49