2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)最新文献_第8页

Deep quaternion neural networks for spoken language understanding 用于口语理解的深度四元数神经网络

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268978

Titouan Parcollet, Mohamed Morchid, G. Linarès

{"title":"Deep quaternion neural networks for spoken language understanding","authors":"Titouan Parcollet, Mohamed Morchid, G. Linarès","doi":"10.1109/ASRU.2017.8268978","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268978","url":null,"abstract":"Deep Neural Networks (DNN) received a great interest from researchers due to their capability to construct robust abstract representations of heterogeneous documents in a latent subspace. Nonetheless, mere real-valued deep neural networks require an appropriate adaptation, such as the convolution process, to capture latent relations between input features. Moreover, real-valued deep neural networks reveal little in way of document internal dependencies, by only considering words or topics contained in the document as an isolate basic element. Quaternion-valued multi-layer per-ceptrons (QMLP), and autoencoders (QAE) have been introduced to capture such latent dependencies, alongside to represent multidimensional data. Nonetheless, a three-layered neural network does not benefit from the high abstraction capability of DNNs. The paper proposes first to extend the hyper-complex algebra to deep neural networks (QDNN) and, then, introduces pre-trained deep quaternion neural networks (QDNN-AE) with dedicated quaternion encoder-decoders (QAE). The experiments conduced on a theme identification task of spoken dialogues from the DECODA data set show, inter alia, that the QDNN-AE reaches a promising gain of 2.2% compared to the standard real-valued DNN-AE.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"150 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133683177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

A hierarchical attention based model for off-topic spontaneous spoken response detection 基于分层注意的脱题自发语音反应检测模型

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268963

A. Malinin, K. Knill, M. Gales

{"title":"A hierarchical attention based model for off-topic spontaneous spoken response detection","authors":"A. Malinin, K. Knill, M. Gales","doi":"10.1109/ASRU.2017.8268963","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268963","url":null,"abstract":"Automatic spoken language assessment and training systems are becoming increasingly popular to handle the growing demand to learn languages. However, current systems often assess only fluency and pronunciation, with limited content-based features being used. This paper examines one particular aspect of content-assessment, off-topic response detection. This is important for deployed systems as it ensures that candidates understood the prompt, and are able to generate an appropriate answer. Previously proposed approaches typically require a set of prompt-response training pairs, which limits flexibility as example responses are required whenever a new test prompt is introduced. Recently, the attention based neural topic model (ATM) was presented, which can assess the relevance of prompt-response pairs regardless of whether the prompt was seen in training. This model uses a bidirectional Recurrent Neural Network (BiRNN) embedding of the prompt combined with an attention mechanism to attend over the hidden states of a BiRNN embedding of the response to compute a fixed-length embedding used to predict relevance. Unfortunately, performance on prompts not seen in the training data is lower than on seen prompts. Thus, this paper adds the following contributions: several improvements to the ATM are examined; a hierarchical variant of the ATM (HATM) is proposed, which explicitly uses prompt similarity to further improve performance on unseen prompts by interpolating over prompts seen in training data given a prompt of interest via a second attention mechanism; an in-depth analysis of both models is conducted and main failure mode identified. On spontaneous spoken data, taken from BULATS tests, these systems are able to assess relevance to both seen and unseen prompts.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"126 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115507351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

On lattice generation for large vocabulary speech recognition 大词汇量语音识别的格生成

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268940

David Rybach, M. Riley, J. Schalkwyk

{"title":"On lattice generation for large vocabulary speech recognition","authors":"David Rybach, M. Riley, J. Schalkwyk","doi":"10.1109/ASRU.2017.8268940","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268940","url":null,"abstract":"Lattice generation is an essential feature of the decoder for many speech recognition applications. In this paper, we first review lattice generation methods for WFST-based decoding and describe in a uniform formalism two established approaches for state-of-the-art speech recognition systems: the phone pair and the N-best histories approaches. We then present a novel optimization method, pruned determinization followed by minimization, that produces a deterministic minimal lattice that retains all paths within specified weight and lattice size thresholds. Experimentally, we show that before optimization, the phone-pair and the N-best histories approaches each have conditions where they perform better when evaluated on video transcription and mixed voice search and dictation tasks. However, once this lattice optimization procedure is applied, the phone pair approach has the lowest oracle WER for a given lattice density by a significant margin. We further show that the pruned determinization presented here is efficient to use during decoding unlike classical weighted determinization from which it is derived. Finally, we consider on-the-fly lattice rescoring in which the lattice generation and combination with the secondary LM are done in one step. We compare the phone pair and N-best histories approaches for this scenario and find the former superior in our experiments.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116273105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Sparse representation of phonetic features for voice conversion with and without parallel data 有无并行数据的语音转换语音特征的稀疏表示

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8269002

Berrak Sisman, Haizhou Li, K. Tan

引用次数: 40

Multi-view (Joint) probability linear discrimination analysis for J-vector based text dependent speaker verification 基于j向量的文本依赖说话人验证的多视图(联合)概率线性判别分析

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268993

Ziqiang Shi, L. Liu, Mengjiao Wang, Rujie Liu

{"title":"Multi-view (Joint) probability linear discrimination analysis for J-vector based text dependent speaker verification","authors":"Ziqiang Shi, L. Liu, Mengjiao Wang, Rujie Liu","doi":"10.1109/ASRU.2017.8268993","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268993","url":null,"abstract":"J-vector has been proved to be very effective in text dependent speaker verification with short-duration speech. However, the current back-end classifiers cannot make full use of such deep features. In this paper, we propose a method to model the multi-faceted information in the j-vector explicitly and jointly. Examples of the multi-faceted information include speaker identity and text content. In our approach, the j-vector was modeled as a result derived by a generative multi-view (joint1) Probability Linear Discriminant Analysis (PLDA) model, which contains multiple kinds of latent variables. The usual PLDA model only considers one single label. However, in practical use, when using multi-task learned network as feature extractor, the extracted feature are always associated with several labels. This type of feature is called multi-view deep feature (e.g. j-vector). With multi-view (joint) PLDA, we are able to explicitly build a model that can combine multiple heterogeneous information from the j-vectors. In verification step, we calculated the likelihood to describe whether the two j-vectors having consistent labels or not. This likelihood is used in the following decision-making. Experiments have been conducted on large scale data corpus of different languages. On the public RSR2015 data corpus, the results showed that our approach can achieve 0.02% EER and 0.09% EER for impostor wrong and impostor correct cases respectively.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130567454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Investigation of lattice-free maximum mutual information-based acoustic models with sequence-level Kullback-Leibler divergence 具有序列级Kullback-Leibler散度的无格最大互信息声学模型研究

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268918

Naoyuki Kanda, Yusuke Fujita, Kenji Nagamatsu

引用次数: 25

Speaker-sensitive dual memory networks for multi-turn slot tagging 多轮槽标注的扬声器敏感双记忆网络

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-11-29 DOI: 10.1109/ASRU.2017.8268983

Young-Bum Kim, Sungjin Lee, R. Sarikaya

引用次数: 8

Acoustic-to-word model without OOV 没有OOV的声学到单词模型

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-11-28 DOI: 10.1109/ASRU.2017.8268924

Jinyu Li, Guoli Ye, Rui Zhao, J. Droppo, Y. Gong

{"title":"Acoustic-to-word model without OOV","authors":"Jinyu Li, Guoli Ye, Rui Zhao, J. Droppo, Y. Gong","doi":"10.1109/ASRU.2017.8268924","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268924","url":null,"abstract":"Recently, the acoustic-to-word model based on the Connectionist Temporal Classification (CTC) criterion was shown as a natural end-to-end model directly targeting words as output units. However, this type of word-based CTC model suffers from the out-of-vocabulary (OOV) issue as it can only model limited number of words in the output layer and maps all the remaining words into an OOV output node. Therefore, such word-based CTC model can only recognize the frequent words modeled by the network output nodes. It also cannot easily handle the hot-words which emerge after the model is trained. In this study, we improve the acoustic-to-word model with a hybrid CTC model which can predict both words and characters at the same time. With a shared-hidden-layer structure and modular design, the alignments of words generated from the word-based CTC and the character-based CTC are synchronized. Whenever the acoustic-to-word model emits an OOV token, we back off that OOV segment to the word output generated from the character-based CTC, hence solving the OOV or hot-words issue. Evaluated on a Microsoft Cortana voice assistant task, the proposed model can reduce the errors introduced by the OOV output token in the acoustic-to-word model by 30%.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"38 35","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131500505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 38

Unsupervised adaptation with domain separation networks for robust speech recognition 基于域分离网络的无监督自适应鲁棒语音识别

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-11-21 DOI: 10.1109/ASRU.2017.8268938

Zhong Meng, Zhuo Chen, V. Mazalov, Jinyu Li, Y. Gong

{"title":"Unsupervised adaptation with domain separation networks for robust speech recognition","authors":"Zhong Meng, Zhuo Chen, V. Mazalov, Jinyu Li, Y. Gong","doi":"10.1109/ASRU.2017.8268938","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268938","url":null,"abstract":"Unsupervised domain adaptation of speech signal aims at adapting a well-trained source-domain acoustic model to the unlabeled data from target domain. This can be achieved by adversarial training of deep neural network (DNN) acoustic models to learn an intermediate deep representation that is both senone-discriminative and domain-invariant. Specifically, the DNN is trained to jointly optimize the primary task of senone classification and the secondary task of domain classification with adversarial objective functions. In this work, instead of only focusing on learning a domain-invariant feature (i.e. the shared component between domains), we also characterize the difference between the source and target domain distributions by explicitly modeling the private component of each domain through a private component extractor DNN. The private component is trained to be orthogonal with the shared component and thus implicitly increases the degree of domain-invariance of the shared component. A reconstructor DNN is used to reconstruct the original speech feature from the private and shared components as a regularization. This domain separation framework is applied to the unsupervised environment adaptation task and achieved 11.08% relative WER reduction from the gradient reversal layer training, a representative adversarial training method, for automatic speech recognition on CHiME-3 dataset.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128969324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 53

Lattice rescoring strategies for long short term memory language models in speech recognition 语音识别中长短时记忆语言模型的点阵评分策略

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-11-15 DOI: 10.1109/ASRU.2017.8268931

Shankar Kumar, M. Nirschl, D. Holtmann-Rice, H. Liao, A. Suresh, Felix X. Yu

引用次数: 36