2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)最新文献

筛选
英文 中文
Gated convolutional networks based hybrid acoustic models for low resource speech recognition 基于门控卷积网络的混合声学模型用于低资源语音识别
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268930
Jian Kang, Weiqiang Zhang, Jia Liu
{"title":"Gated convolutional networks based hybrid acoustic models for low resource speech recognition","authors":"Jian Kang, Weiqiang Zhang, Jia Liu","doi":"10.1109/ASRU.2017.8268930","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268930","url":null,"abstract":"In acoustic modeling for large vocabulary speech recognition, recurrent neural networks (RNN) have shown great abilities to model temporal dependencies. However, the performance of RNN is not prominent in resource limited tasks, even worse than the traditional feedforward neural networks (FNN). Furthermore, training time for RNN is much more than that for FNN. In recent years, some novel models are provided. They use non-recurrent architectures to model long term dependencies. In these architectures, they show that using gate mechanism is an effective method to construct acoustic models. On the other hand, it has been proved that using convolution operation is a good method to learn acoustic features. We hope to take advantages of both these two methods. In this paper we present a gated convolutional approach to low resource speech recognition tasks. The gated convolutional networks use convolutional architectures to learn input features and a gate to control information. Experiments are conducted on the OpenKWS, a series of low resource keyword search evaluations. From the results, the gated convolutional networks relatively decrease the WER about 6% over the baseline LSTM models, 5% over the DNN models and 3% over the BLSTM models. In addition, the new models accelerate the learning speed by more than 1.8 and 3.2 times compared to that of the baseline LSTM and BLSTM models.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123352432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Deep learning methods for unsupervised acoustic modeling — Leap submission to ZeroSpeech challenge 2017 无监督声学建模的深度学习方法- Leap提交给2017年零语音挑战
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8269013
T. Ansari, Rajath Kumar, Sonali Singh, Sriram Ganapathy
{"title":"Deep learning methods for unsupervised acoustic modeling — Leap submission to ZeroSpeech challenge 2017","authors":"T. Ansari, Rajath Kumar, Sonali Singh, Sriram Ganapathy","doi":"10.1109/ASRU.2017.8269013","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8269013","url":null,"abstract":"In this paper, we present our system submission to the ZeroSpeech 2017 Challenge. The track1 of this challenge is intended to develop language independent speech representations that provide the least pairwise ABX distance computed for within speaker and across speaker pairs of spoken words. We investigate two approaches based on deep learning methods for unsupervised modeling. In the first approach, a deep neural network (DNN) is trained on the posteriors of mixture component indices obtained from training a Gaussian mixture model (GMM)-UBM. In the second approach, we develop a similar hidden Markov model (HMM) based DNN model to learn the unsupervised acoustic units provided by HMM state alignments. In addition, we also develop a deep autoencoder which learns language independent embeddings of speech to train the HMM-DNN model. Both the approaches do not use any labeled training data or require any supervision. We perform several experiments using the ZeroSpeech 2017 corpus with the minimal pair ABX error measure. In these experiments, we find that the two proposed approaches significantly improve over the baseline system using MFCC features (average relative improvements of 30–40%). Furthermore, the system combination of the two proposed approaches improves the performance over the best individual system.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"128 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132368206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Denotation extraction for interactive learning in dialogue systems 对话系统中交互式学习的外延提取
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268976
Miroslav Vodolán, Filip Jurcícek
{"title":"Denotation extraction for interactive learning in dialogue systems","authors":"Miroslav Vodolán, Filip Jurcícek","doi":"10.1109/ASRU.2017.8268976","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268976","url":null,"abstract":"This paper presents a novel task using real user data obtained in human-machine conversation. The task concerns with denotation extraction from answer hints collected interactively in a dialogue. The task is motivated by the need for large amounts of training data for question answering dialogue system development, where the data is often expensive and hard to collect. Being able to collect denotation interactively and directly from users, one could improve, for example, natural understanding components on-line and ease the collection of the training data. This paper also presents introductory results of evaluation of several denotation extraction models including attention-based neural network approaches.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122570221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Incremental training and constructing the very deep convolutional residual network acoustic models 增量训练和构造极深卷积残差网络声学模型
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268939
Sheng Li, Xugang Lu, Peng Shen, R. Takashima, Tatsuya Kawahara, H. Kawai
{"title":"Incremental training and constructing the very deep convolutional residual network acoustic models","authors":"Sheng Li, Xugang Lu, Peng Shen, R. Takashima, Tatsuya Kawahara, H. Kawai","doi":"10.1109/ASRU.2017.8268939","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268939","url":null,"abstract":"Inspired by the successful applications in image recognition, the very deep convolutional residual network (ResNet) based model has been applied in automatic speech recognition (ASR). However, the computational load is heavy for training the ResNet with a large quantity of data. In this paper, we propose an incremental model training framework to accelerate the training process of the ResNet. The incremental model training framework is based on the unequal importance of each layer and connection in the ResNet. The modules with important layers and connections are regarded as a skeleton model, while those left are regarded as an auxiliary model. The total depth of the skeleton model is quite shallow compared to the very deep full network. In our incremental training, the skeleton model is first trained with the full training data set. Other layers and connections belonging to the auxiliary model are gradually attached to the skeleton model and tuned. Our experiments showed that the proposed incremental training obtained comparable performances and faster training speed compared with the model training as a whole without consideration of the different importance of each layer.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124639980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Learning modality-invariant representations for speech and images 学习语音和图像的模态不变表示
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268967
K. Leidal, David F. Harwath, James R. Glass
{"title":"Learning modality-invariant representations for speech and images","authors":"K. Leidal, David F. Harwath, James R. Glass","doi":"10.1109/ASRU.2017.8268967","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268967","url":null,"abstract":"In this paper, we explore the unsupervised learning of a semantic embedding space for co-occurring sensory inputs. Specifically, we focus on the task of learning a semantic vector space for both spoken and handwritten digits using the TIDIGITs and MNIST datasets. Current techniques encode image and audio/textual inputs directly to semantic embeddings. In contrast, our technique maps an input to the mean and log variance vectors of a diagonal Gaussian from which sample semantic embeddings are drawn. In addition to encouraging semantic similarity between co-occurring inputs, our loss function includes a regularization term borrowed from variational autoencoders (VAEs) which drives the posterior distributions over embeddings to be unit Gaussian. We can use this regularization term to filter out modality information while preserving semantic information. We speculate this technique may be more broadly applicable to other areas of cross-modality/domain information retrieval and transfer learning.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121677762","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
Subband wavenet with overlapped single-sideband filterbanks 具有重叠的单边带滤波器组的子带波
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8269005
T. Okamoto, Kentaro Tachibana, T. Toda, Y. Shiga, H. Kawai
{"title":"Subband wavenet with overlapped single-sideband filterbanks","authors":"T. Okamoto, Kentaro Tachibana, T. Toda, Y. Shiga, H. Kawai","doi":"10.1109/ASRU.2017.8269005","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8269005","url":null,"abstract":"Compared with conventional vocoders, deep neural network-based raw audio generative models, such as WaveNet and SampleRNN, can more naturally synthesize speech signals, although the synthesis speed is a problem, especially with high sampling frequency. This paper provides subband WaveNet based on multirate signal processing for high-speed and high-quality synthesis with raw audio generative models. In the training stage, speech waveforms are decomposed and decimated into subband short waveforms with a low sampling rate, and each subband WaveNet network is trained using each subband stream. In the synthesis stage, each generated signal is up-sampled and integrated into a fullband speech signal. The results of objective and subjective experiments for unconditional WaveNet with a sampling frequency of 32 kHz indicate that the proposed subband WaveNet with a square-root Hann window-based overlapped 9-channel single-sideband filterbank can realize about four times the synthesis speed and improve the synthesized speech quality more than the conventional fullband WaveNet.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132721158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Grounded language understanding for manipulation instructions using GAN-based classification 基于gan分类的操作指令的基础语言理解
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268980
K. Sugiura, H. Kawai
{"title":"Grounded language understanding for manipulation instructions using GAN-based classification","authors":"K. Sugiura, H. Kawai","doi":"10.1109/ASRU.2017.8268980","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268980","url":null,"abstract":"The target task of this study is grounded language understanding for domestic service robots (DSRs). In particular, we focus on instruction understanding for short sentences where verbs are missing. This task is of critical importance to build communicative DSRs because manipulation is essential for DSRs. Existing instruction understanding methods usually estimate missing information only from non-grounded knowledge; therefore, whether the predicted action is physically executable or not was unclear. In this paper, we present a grounded instruction understanding method to estimate appropriate objects given an instruction and situation. We extend the Generative Adversarial Nets (GAN) and build a GAN-based classifier using latent representations. To quantitatively evaluate the proposed method, we have developed a data set based on the standard data set used for visual question answering (VQA). Experimental results have shown that the proposed method gives the better result than baseline methods.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"166 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117053735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Tackling unseen acoustic conditions in query-by-example search using time and frequency convolution for multilingual deep bottleneck features 针对多语言深度瓶颈特征,利用时间和频率卷积处理逐例查询搜索中未见的声学条件
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268915
Julien van Hout, V. Mitra, H. Franco, C. Bartels, D. Vergyri
{"title":"Tackling unseen acoustic conditions in query-by-example search using time and frequency convolution for multilingual deep bottleneck features","authors":"Julien van Hout, V. Mitra, H. Franco, C. Bartels, D. Vergyri","doi":"10.1109/ASRU.2017.8268915","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268915","url":null,"abstract":"Standard keyword spotting based on Automatic Speech Recognition (ASR) cannot be used on low-and no-resource languages due to lack of annotated data and/or linguistic resources. In recent years, query-by-example (QbE) has emerged as an alternate way to enroll and find spoken queries in large audio corpora, yet mismatched and unseen acoustic conditions remain a difficult challenge given the lack of enrollment data. This paper revisits two neural network architectures developed for noise and channel-robust ASR, and applies them to building a state-of-art multilingual QbE system. By applying convolution in time or frequency across the spectrum, those convolutional bottlenecks learn more discriminative deep bottleneck features. In conjunction with dynamic time warping (DTW), these features enable robust QbE systems. We use the MediaEval 2014 QUESST data to evaluate robustness against language and channel mismatches, and add several levels of artificial noise to the data to evaluate performance in degraded acoustic environments. We also assess performance on an Air Traffic Control QbE task with more realistic and higher levels of distortion in the push-to-talk domain.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117341082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Error detection of grapheme-to-phoneme conversion in text-to-speech synthesis using speech signal and lexical context 基于语音信号和词汇语境的文本-语音合成中字素-音素转换的错误检测
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8269004
Kevin Vythelingum, Y. Estève, O. Rosec
{"title":"Error detection of grapheme-to-phoneme conversion in text-to-speech synthesis using speech signal and lexical context","authors":"Kevin Vythelingum, Y. Estève, O. Rosec","doi":"10.1109/ASRU.2017.8269004","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8269004","url":null,"abstract":"In unit selection text-to-speech synthesis, voice creation involved a phonemic transcription of read speech. This is produced by an automatic grapheme-to-phoneme conversion of the text read, followed by a manual correction. Although grapheme-to-phoneme conversion makes few errors, the manual correction is time consuming as every generated phoneme should be checked. We propose a method to automatically detect grapheme-to-phoneme conversion errors by comparing contrastives phonemisation hypothesis. A lattice-based forced alignment system is implemented, allowing for signal-dependent phonemisation. We implement also a sequence-to-sequence neural network model to obtain a context-dependent grapheme-to-phoneme conversion. On a French dataset, we show that we can detect to 86.3% of the errors made by a commercial grapheme-to-phoneme system. Moreover, the amount of data annotated as erroneous is kept under 10% of the total evaluation data. The time spent for phoneme manual checking can thus been drastically reduced without decreasing significantly the phonemic transcription quality.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115626251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Learning speaker representation for neural network based multichannel speaker extraction 学习基于神经网络的多通道说话人提取的说话人表示
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268910
Kateřina Žmolíková, Marc Delcroix, K. Kinoshita, T. Higuchi, A. Ogawa, T. Nakatani
{"title":"Learning speaker representation for neural network based multichannel speaker extraction","authors":"Kateřina Žmolíková, Marc Delcroix, K. Kinoshita, T. Higuchi, A. Ogawa, T. Nakatani","doi":"10.1109/ASRU.2017.8268910","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268910","url":null,"abstract":"Recently, schemes employing deep neural networks (DNNs) for extracting speech from noisy observation have demonstrated great potential for noise robust automatic speech recognition. However, these schemes are not well suited when the interfering noise is another speaker. To enable extracting a target speaker from a mixture of speakers, we have recently proposed to inform the neural network using speaker information extracted from an adaptation utterance from the same speaker. In our previous work, we explored ways how to inform the network about the speaker and found a speaker adaptive layer approach to be suitable for this task. In our experiments, we used speaker features designed for speaker recognition tasks as the additional speaker information, which may not be optimal for the speaker extraction task. In this paper, we propose a usage of a sequence summarizing scheme enabling to learn the speaker representation jointly with the network. Furthermore, we extend the previous experiments to demonstrate the potential of our proposed method as a front-end for speech recognition and explore the effect of additional noise on the performance of the method.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124501345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 49
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信