2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)最新文献

筛选
英文 中文
Computational cost reduction of long short-term memory based on simultaneous compression of input and hidden state 基于输入和隐藏状态同步压缩的长短期记忆计算成本降低
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268926
T. Masuko
{"title":"Computational cost reduction of long short-term memory based on simultaneous compression of input and hidden state","authors":"T. Masuko","doi":"10.1109/ASRU.2017.8268926","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268926","url":null,"abstract":"Long short-term memory (LSTM) has been successfully applied to acoustic modeling for automatic speech recognition (ASR). However, because of its complicated structure, LSTM requires high computational cost especially when the number of dimensions of memory cell is sufficiently high to get good ASR performance. In this paper, we present a novel technique to reduce computational cost of LSTM in which the input and the previous hidden state vectors are simultaneously compressed with a linear projection layer. From experimental results, it is shown that the proposed technique outperforms a standard LSTM and an LSTM with a recurrent projection layer. It is also shown that in the proposed technique ASR performance is improved by increasing the number of dimensions of memory cell when the sizes of models are comparable.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121414567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
The USTC system for blizzard machine learning challenge 2017-ES2 暴雪机器学习挑战赛2017-ES2的USTC系统
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268998
Ya-Jun Hu, Li-Juan Liu, Chuang Ding, Zhenhua Ling, Lirong Dai
{"title":"The USTC system for blizzard machine learning challenge 2017-ES2","authors":"Ya-Jun Hu, Li-Juan Liu, Chuang Ding, Zhenhua Ling, Lirong Dai","doi":"10.1109/ASRU.2017.8268998","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268998","url":null,"abstract":"The Blizzard Machine Learning Challenge (BMLC) aims to liberate participants from speech-specific processing when building speech synthesis systems. This paper describes the USTC system for the ES2 sub-task in BMLC2017, which requires participants to train a model to directly predict waveforms from linguistic features. We investigate three aspects of waveform modeling when preparing our system for this task. First, two different model structures for waveform modeling, i.e., WaveNet and SampleRNN, are compared on this task. Second, a strategy of using features extracted from waveforms as intermediate representations for waveform modeling is studied. Experimental results show that using low-level features (STFT amplitude spectra) as intermediate representations can achieve similar performance as using high-level features (mel-cepstra and F0). Third, the feasibility of applying WaveNet to wideband speech signals with more than 256 quantization levels is verified by experiments. Finally, a system which adopts STFT amplitude spectra as intermediate representations to model 24kHz speech waveforms with 1024 mu-law quantization levels is submitted for evaluation. The evaluation results of BMLC2017 demonstrate the effectiveness of our proposed methods.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115675993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Improving separation of overlapped speech for meeting conversations using uncalibrated microphone array 改进使用未校准麦克风阵列的会议对话中重叠语音的分离
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268916
Keisuke Nakamura, R. Gomez
{"title":"Improving separation of overlapped speech for meeting conversations using uncalibrated microphone array","authors":"Keisuke Nakamura, R. Gomez","doi":"10.1109/ASRU.2017.8268916","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268916","url":null,"abstract":"In this paper, we propose a novel approach of sound source separation for meeting conversations even when using an uncalibrated microphone array. Our method can blindly estimate three parameters for separation, namely Steering Vectors (SVs), speaker indices, and activity periods of each speaker. First, we estimate the number of speakers and SVs by clustering Time Delay Of Arrival (TDOA) of the observed signal and selecting major clusters to compute TDOA-based SVs. Then, speaker indices and activity periods are estimated by thresholding spatial spectrum using estimated SVs, whose threshold is blindly obtained. Finally, we separate overlapped speeches/noise based on dynamic design of noise correlation matrices of the minimum variance distortionless response (MVDR) beamformer using blindly estimated parameters. The proposed algorithm was evaluated in both separation objective measure and recognition correct rate and showed improvements in both single and simultaneous speech scenarios in a reverberant meeting room. Moreover, the blindly estimated parameters improved separation and recognition compared to geometrically obtained parameters.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130607696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Multilingual bottle-neck feature learning from untranscribed speech 从未转录语音中学习多语言瓶颈特征
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8269009
Hongjie Chen, C. Leung, Lei Xie, B. Ma, Haizhou Li
{"title":"Multilingual bottle-neck feature learning from untranscribed speech","authors":"Hongjie Chen, C. Leung, Lei Xie, B. Ma, Haizhou Li","doi":"10.1109/ASRU.2017.8269009","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8269009","url":null,"abstract":"We propose to learn a low-dimensional feature representation for multiple languages without access to their manual transcription. The multilingual features are extracted from a shared bottleneck layer of a multi-task learning deep neural network which is trained using un-supervised phoneme-like labels. The unsupervised phoneme-like labels are obtained from language-dependent Dirichlet process Gaussian mixture models (DPGMMs). Vocal tract length normalization (VTLN) is applied to mel-frequency cepstral coefficients to reduce talker variation when DPGMMs are trained. The proposed features are evaluated using the ABX phoneme discriminability test in the Zero Resource Speech Challenge 2017. In the experiments, we show that the proposed features perform well across different languages, and they consistently outperform our previously proposed DPGMM posteriorgrams which topped the performance in the same challenge in 2015.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114444043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 32
Noise-robust exemplar matching for rescoring query-by-example search 用于逐例查询搜索记录的噪声鲁棒样本匹配
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268909
Emre Yilmaz, Julien van Hout, H. Franco
{"title":"Noise-robust exemplar matching for rescoring query-by-example search","authors":"Emre Yilmaz, Julien van Hout, H. Franco","doi":"10.1109/ASRU.2017.8268909","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268909","url":null,"abstract":"This paper describes a two-step approach for keyword spotting task in which a query-by-example (QbE) search is followed by noise robust exemplar matching (N-REM) rescoring. In the first stage, subsequence dynamic time warping is performed to detect keywords in search utterances. In the second stage, these target frame sequences are rescored using the reconstruction errors provided by the linear combination of the available exemplars extracted from the training data. Due to data sparsity, we align the target frame sequence and the exemplars to a common frame length and the exemplar weights are obtained by solving a convex optimization problem with nonnegative sparse coding. We run keyword spotting experiments on the Air Traffic Control (ATC) database and evaluate performance of multiple distance metrics for calculating the weights and reconstruction errors using convolutional neural network (CNN) bottleneck features. The results demonstrate that the proposed two-step keyword spotting approach provides better keyword detection compared to a baseline with only QbE search.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121741594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Sequence training of DNN acoustic models with natural gradient 自然梯度DNN声学模型的序列训练
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268933
Adnan Haider, P. Woodland
{"title":"Sequence training of DNN acoustic models with natural gradient","authors":"Adnan Haider, P. Woodland","doi":"10.1109/ASRU.2017.8268933","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268933","url":null,"abstract":"Deep Neural Network (DNN) acoustic models often use discriminative sequence training that optimises an objective function that better approximates the word error rate (WER) than frame-based training. Sequence training is normally implemented using Stochastic Gradient Descent (SGD) or Hessian Free (HF) training. This paper proposes an alternative batch style optimisation framework that employs a Natural Gradient (NG) approach to traverse through the parameter space. By correcting the gradient according to the local curvature of the KL-divergence, the NG optimisation process converges more quickly than HF. Furthermore, the proposed NG approach can be applied to any sequence discriminative training criterion. The efficacy of the NG method is shown using experiments on a Multi-Genre Broadcast (MGB) transcription task that demonstrates both the computational efficiency and the accuracy of the resulting DNN models.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"39 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115539933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Meeting recognition with asynchronous distributed microphone array 采用异步分布式麦克风阵列进行会议识别
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268913
S. Araki, Nobutaka Ono, K. Kinoshita, Marc Delcroix
{"title":"Meeting recognition with asynchronous distributed microphone array","authors":"S. Araki, Nobutaka Ono, K. Kinoshita, Marc Delcroix","doi":"10.1109/ASRU.2017.8268913","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268913","url":null,"abstract":"Recently, recognition of conversational speech such as meetings has widely been studied. However, most existing approaches rely on using a single close talking microphone or a distant microphone array where all the microphones are synchronous. In contrast, this paper tackles a recognition task of conversational speech recorded with asynchronous distributed microphones, to which conventional array processing is not directly applicable. We demonstrate that we can significantly improve recognition performance even when microphones are asynchronous by combining blind synchronization and state-of-the-art microphone array speech enhancement techniques such as independent vector analysis (IVA) and a time-frequency mask based minimum variance distortionless response (MVDR) beamformer. Using such a front-end, we could reduce the word error rate from 42.2 % to 29.9 % for real meeting recordings.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121418130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
JHU Kaldi system for Arabic MGB-3 ASR challenge using diarization, audio-transcript alignment and transfer learning JHU Kaldi系统用于阿拉伯语MGB-3 ASR挑战,使用拨号,音频转录对齐和迁移学习
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268956
Vimal Manohar, Daniel Povey, S. Khudanpur
{"title":"JHU Kaldi system for Arabic MGB-3 ASR challenge using diarization, audio-transcript alignment and transfer learning","authors":"Vimal Manohar, Daniel Povey, S. Khudanpur","doi":"10.1109/ASRU.2017.8268956","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268956","url":null,"abstract":"This paper describes the JHU team's Kaldi system submission to the Arabic MGB-3: The Arabic speech recognition in the Wild Challenge for ASRU-2017. We use a weights transfer approach to adapt a neural network trained on the out-of-domain MGB-2 multi-dialect Arabic TV broadcast corpus to the MGB-3 Egyptian YouTube video corpus. The neural network has a TDNN-LSTM architecture and is trained using lattice-free maximum mutual information (LF-MMI) objective followed by sMBR discriminative training. For supervision, we fuse transcripts from 4 independent transcribers into confusion network training graphs. We also describe our own approach for speaker diarization and audio-transcript alignment. We use this to prepare lightly supervised transcriptions for training the seed system used for adaptation to MGB-3. Our primary submission to the challenge gives a multi-reference WER of 32.78% on the MGB-3 test set.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116349657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 43
Unsupervised HMM posteriograms for language independent acoustic modeling in zero resource conditions 零资源条件下语言无关声学建模的无监督HMM后图
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8269014
T. Ansari, Rajath Kumar, Sonali Singh, Sriram Ganapathy, V. Devi
{"title":"Unsupervised HMM posteriograms for language independent acoustic modeling in zero resource conditions","authors":"T. Ansari, Rajath Kumar, Sonali Singh, Sriram Ganapathy, V. Devi","doi":"10.1109/ASRU.2017.8269014","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8269014","url":null,"abstract":"The task of language independent acoustic unit modeling in unlabeled raw speech (zero-resource setting) has gained significant interest over the recent years. The main challenge here is the extraction of acoustic representations that elicit good similarity between the same words or linguistic tokens spoken by different speakers and to derive these representations in a language independent manner. In this paper, we explore the use of Hidden Markov Model (HMM) based posteriograms for unsupervised acoustic unit modeling. The states of the HMM (which represent the language independent acoustic units) are initialized using a Gaussian mixture model (GMM) — Universal Background Model (UBM). The trained HMM is subsequently used to generate a temporally contiguous state alignment which are then modeled in a hybrid deep neural network (DNN) model. For the purpose of testing, we use the frame level HMM state posteriors obtained from the DNN as features for the ZeroSpeech challenge task. The minimal pair ABX error rate is measured for both the within and across speaker pairs. With several experiments on multiple languages in the ZeroSpeech corpus, we show that the proposed HMM based posterior features provides significant improvements over the baseline system using MFCC features (average relative improvements of 25% for within speaker pairs and 40% for across speaker pairs). Furthermore, the experiments where the target language is not seen training illustrate the proposed modeling approach is capable of learning global language independent representations.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"447 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125776955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Feature optimized DPGMM clustering for unsupervised subword modeling: A contribution to zerospeech 2017 无监督子词建模的特征优化DPGMM聚类:对zerospeech 2017的贡献
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8269011
Michael Heck, S. Sakti, Satoshi Nakamura
{"title":"Feature optimized DPGMM clustering for unsupervised subword modeling: A contribution to zerospeech 2017","authors":"Michael Heck, S. Sakti, Satoshi Nakamura","doi":"10.1109/ASRU.2017.8269011","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8269011","url":null,"abstract":"This paper describes our unsupervised subword modeling pipeline for the zero resource speech challenge (ZeroSpeech) 2017. Our approach is built around the Dirichlet process Gaussian mixture model (DPGMM) that we use to cluster speech feature vectors into a dynamically sized set of classes. By considering each class an acoustic unit, speech can be represented as sequence of class posteriorgrams. We enhance this method by automatically optimizing the DPGMM sampler's input features in a multi-stage clustering framework, where we unsupervisedly learn transformations using LDA, MLLT and (basis) fMLLR to reduce variance in the features. We show that this optimization considerably boosts the subword modeling quality, according to the performance on the ABX phone discriminability task. For the first time, we apply inferred subword models to previously unseen data from a new set of speakers. We demonstrate our method's good generalization and the effectiveness of its blind speaker adaptation in extensive experiments on a multitude of datasets. Our pipeline has very little need for hyper-parameter adjustment and is entirely unsupervised, i.e., it only takes raw audio recordings as input, without requiring any pre-defined segmentation, explicit speaker IDs or other meta data.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"6 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128294116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 49
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信