{"title":"Deep quaternion neural networks for spoken language understanding","authors":"Titouan Parcollet, Mohamed Morchid, G. Linarès","doi":"10.1109/ASRU.2017.8268978","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268978","url":null,"abstract":"Deep Neural Networks (DNN) received a great interest from researchers due to their capability to construct robust abstract representations of heterogeneous documents in a latent subspace. Nonetheless, mere real-valued deep neural networks require an appropriate adaptation, such as the convolution process, to capture latent relations between input features. Moreover, real-valued deep neural networks reveal little in way of document internal dependencies, by only considering words or topics contained in the document as an isolate basic element. Quaternion-valued multi-layer per-ceptrons (QMLP), and autoencoders (QAE) have been introduced to capture such latent dependencies, alongside to represent multidimensional data. Nonetheless, a three-layered neural network does not benefit from the high abstraction capability of DNNs. The paper proposes first to extend the hyper-complex algebra to deep neural networks (QDNN) and, then, introduces pre-trained deep quaternion neural networks (QDNN-AE) with dedicated quaternion encoder-decoders (QAE). The experiments conduced on a theme identification task of spoken dialogues from the DECODA data set show, inter alia, that the QDNN-AE reaches a promising gain of 2.2% compared to the standard real-valued DNN-AE.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"150 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133683177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A hierarchical attention based model for off-topic spontaneous spoken response detection","authors":"A. Malinin, K. Knill, M. Gales","doi":"10.1109/ASRU.2017.8268963","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268963","url":null,"abstract":"Automatic spoken language assessment and training systems are becoming increasingly popular to handle the growing demand to learn languages. However, current systems often assess only fluency and pronunciation, with limited content-based features being used. This paper examines one particular aspect of content-assessment, off-topic response detection. This is important for deployed systems as it ensures that candidates understood the prompt, and are able to generate an appropriate answer. Previously proposed approaches typically require a set of prompt-response training pairs, which limits flexibility as example responses are required whenever a new test prompt is introduced. Recently, the attention based neural topic model (ATM) was presented, which can assess the relevance of prompt-response pairs regardless of whether the prompt was seen in training. This model uses a bidirectional Recurrent Neural Network (BiRNN) embedding of the prompt combined with an attention mechanism to attend over the hidden states of a BiRNN embedding of the response to compute a fixed-length embedding used to predict relevance. Unfortunately, performance on prompts not seen in the training data is lower than on seen prompts. Thus, this paper adds the following contributions: several improvements to the ATM are examined; a hierarchical variant of the ATM (HATM) is proposed, which explicitly uses prompt similarity to further improve performance on unseen prompts by interpolating over prompts seen in training data given a prompt of interest via a second attention mechanism; an in-depth analysis of both models is conducted and main failure mode identified. On spontaneous spoken data, taken from BULATS tests, these systems are able to assess relevance to both seen and unseen prompts.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"126 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115507351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On lattice generation for large vocabulary speech recognition","authors":"David Rybach, M. Riley, J. Schalkwyk","doi":"10.1109/ASRU.2017.8268940","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268940","url":null,"abstract":"Lattice generation is an essential feature of the decoder for many speech recognition applications. In this paper, we first review lattice generation methods for WFST-based decoding and describe in a uniform formalism two established approaches for state-of-the-art speech recognition systems: the phone pair and the N-best histories approaches. We then present a novel optimization method, pruned determinization followed by minimization, that produces a deterministic minimal lattice that retains all paths within specified weight and lattice size thresholds. Experimentally, we show that before optimization, the phone-pair and the N-best histories approaches each have conditions where they perform better when evaluated on video transcription and mixed voice search and dictation tasks. However, once this lattice optimization procedure is applied, the phone pair approach has the lowest oracle WER for a given lattice density by a significant margin. We further show that the pruned determinization presented here is efficient to use during decoding unlike classical weighted determinization from which it is derived. Finally, we consider on-the-fly lattice rescoring in which the lattice generation and combination with the secondary LM are done in one step. We compare the phone pair and N-best histories approaches for this scenario and find the former superior in our experiments.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116273105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Sparse representation of phonetic features for voice conversion with and without parallel data","authors":"Berrak Sisman, Haizhou Li, K. Tan","doi":"10.1109/ASRU.2017.8269002","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8269002","url":null,"abstract":"This paper presents a voice conversion framework that uses phonetic information in an exemplar-based voice conversion approach. The proposed idea is motivated by the fact that phone-dependent exemplars lead to better estimation of activation matrix, therefore, possibly better conversion. We propose to use the phone segmentation results from automatic speech recognition (ASR) to construct a sub-dictionary for each phone. The proposed framework can work with or without parallel training data. With parallel training data, we found that phonetic sub-dictionary outperforms the state-of-the-art baseline in objective and subjective evaluations. Without parallel training data, we use Phonetic PosteriorGrams (PPGs) as the speaker-independent exemplars in the phonetic sub-dictionary to serve as a bridge between speakers. We report that such technique achieves a competitive performance without the need of parallel training data.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"13 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128144414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-view (Joint) probability linear discrimination analysis for J-vector based text dependent speaker verification","authors":"Ziqiang Shi, L. Liu, Mengjiao Wang, Rujie Liu","doi":"10.1109/ASRU.2017.8268993","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268993","url":null,"abstract":"J-vector has been proved to be very effective in text dependent speaker verification with short-duration speech. However, the current back-end classifiers cannot make full use of such deep features. In this paper, we propose a method to model the multi-faceted information in the j-vector explicitly and jointly. Examples of the multi-faceted information include speaker identity and text content. In our approach, the j-vector was modeled as a result derived by a generative multi-view (joint1) Probability Linear Discriminant Analysis (PLDA) model, which contains multiple kinds of latent variables. The usual PLDA model only considers one single label. However, in practical use, when using multi-task learned network as feature extractor, the extracted feature are always associated with several labels. This type of feature is called multi-view deep feature (e.g. j-vector). With multi-view (joint) PLDA, we are able to explicitly build a model that can combine multiple heterogeneous information from the j-vectors. In verification step, we calculated the likelihood to describe whether the two j-vectors having consistent labels or not. This likelihood is used in the following decision-making. Experiments have been conducted on large scale data corpus of different languages. On the public RSR2015 data corpus, the results showed that our approach can achieve 0.02% EER and 0.09% EER for impostor wrong and impostor correct cases respectively.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130567454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Investigation of lattice-free maximum mutual information-based acoustic models with sequence-level Kullback-Leibler divergence","authors":"Naoyuki Kanda, Yusuke Fujita, Kenji Nagamatsu","doi":"10.1109/ASRU.2017.8268918","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268918","url":null,"abstract":"Lattice-free maximum mutual information (LFMMI) was recently proposed as a mixture of the ideas of hidden-Markov-model-based acoustic models (AMs) and connectionist-temporal-classification-based AMs. In this paper, we investigate LFMMI from various perspectives of model combination, teacher-student training, and unsupervised speaker adaptation. Especially, we thoroughly investigate the use of the “sequence-level” Kullback-Leibler divergence with its novel and simple error derivation to enhance LFMMI-based AMs. In our experiment, we used the corpus of spontaneous Japanese (CSJ). Our best AM was an ensemble of three types of time delay neural networks and one long short-term memory-based network, and it finally achieved a WER of 6.94%, which is, to the best of our knowledge, the best published result for the CSJ.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116811310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Speaker-sensitive dual memory networks for multi-turn slot tagging","authors":"Young-Bum Kim, Sungjin Lee, R. Sarikaya","doi":"10.1109/ASRU.2017.8268983","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268983","url":null,"abstract":"In multi-turn dialogs, natural language understanding models can introduce obvious errors by being blind to contextual information. To incorporate dialog history, we present a neural architecture with Speaker-Sensitive Dual Memory Networks which encode utterances differently depending on the speaker. This addresses the different extents of information available to the system — the system knows only the surface form of user utterances while it has the exact semantics of system output. We performed experiments on real user data from Microsoft Cortana, a commercial personal assistant. The result showed a significant performance improvement over the state-of-the-art slot tagging models using contextual information.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116689531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Acoustic-to-word model without OOV","authors":"Jinyu Li, Guoli Ye, Rui Zhao, J. Droppo, Y. Gong","doi":"10.1109/ASRU.2017.8268924","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268924","url":null,"abstract":"Recently, the acoustic-to-word model based on the Connectionist Temporal Classification (CTC) criterion was shown as a natural end-to-end model directly targeting words as output units. However, this type of word-based CTC model suffers from the out-of-vocabulary (OOV) issue as it can only model limited number of words in the output layer and maps all the remaining words into an OOV output node. Therefore, such word-based CTC model can only recognize the frequent words modeled by the network output nodes. It also cannot easily handle the hot-words which emerge after the model is trained. In this study, we improve the acoustic-to-word model with a hybrid CTC model which can predict both words and characters at the same time. With a shared-hidden-layer structure and modular design, the alignments of words generated from the word-based CTC and the character-based CTC are synchronized. Whenever the acoustic-to-word model emits an OOV token, we back off that OOV segment to the word output generated from the character-based CTC, hence solving the OOV or hot-words issue. Evaluated on a Microsoft Cortana voice assistant task, the proposed model can reduce the errors introduced by the OOV output token in the acoustic-to-word model by 30%.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"38 35","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131500505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhong Meng, Zhuo Chen, V. Mazalov, Jinyu Li, Y. Gong
{"title":"Unsupervised adaptation with domain separation networks for robust speech recognition","authors":"Zhong Meng, Zhuo Chen, V. Mazalov, Jinyu Li, Y. Gong","doi":"10.1109/ASRU.2017.8268938","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268938","url":null,"abstract":"Unsupervised domain adaptation of speech signal aims at adapting a well-trained source-domain acoustic model to the unlabeled data from target domain. This can be achieved by adversarial training of deep neural network (DNN) acoustic models to learn an intermediate deep representation that is both senone-discriminative and domain-invariant. Specifically, the DNN is trained to jointly optimize the primary task of senone classification and the secondary task of domain classification with adversarial objective functions. In this work, instead of only focusing on learning a domain-invariant feature (i.e. the shared component between domains), we also characterize the difference between the source and target domain distributions by explicitly modeling the private component of each domain through a private component extractor DNN. The private component is trained to be orthogonal with the shared component and thus implicitly increases the degree of domain-invariance of the shared component. A reconstructor DNN is used to reconstruct the original speech feature from the private and shared components as a regularization. This domain separation framework is applied to the unsupervised environment adaptation task and achieved 11.08% relative WER reduction from the gradient reversal layer training, a representative adversarial training method, for automatic speech recognition on CHiME-3 dataset.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128969324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shankar Kumar, M. Nirschl, D. Holtmann-Rice, H. Liao, A. Suresh, Felix X. Yu
{"title":"Lattice rescoring strategies for long short term memory language models in speech recognition","authors":"Shankar Kumar, M. Nirschl, D. Holtmann-Rice, H. Liao, A. Suresh, Felix X. Yu","doi":"10.1109/ASRU.2017.8268931","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268931","url":null,"abstract":"Recurrent neural network (RNN) language models (LMs) and Long Short Term Memory (LSTM) LMs, a variant of RNN LMs, have been shown to outperform traditional N-gram LMs on speech recognition tasks. However, these models are computationally more expensive than N-gram LMs for decoding, and thus, challenging to integrate into speech recognizers. Recent research has proposed the use of lattice-rescoring algorithms using RNNLMs and LSTMLMs as an efficient strategy to integrate these models into a speech recognition system. In this paper, we evaluate existing lattice rescoring algorithms along with new variants on a YouTube speech recognition task. Lattice rescoring using LSTMLMs reduces the word error rate (WER) for this task by 8% relative to the WER obtained using an N-gram LM.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128015688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}