Fotios Lygerakis, Vassilios Diakoloulas, M. Lagoudakis, M. Kotti
{"title":"Robust Belief State Space Representation for Statistical Dialogue Managers Using Deep Autoencoders","authors":"Fotios Lygerakis, Vassilios Diakoloulas, M. Lagoudakis, M. Kotti","doi":"10.1109/ASRU46091.2019.9003871","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003871","url":null,"abstract":"Statistical Dialogue Systems (SDS) have proved their humongous potential over the past few years. However, the lack of efficient and robust representations of the belief state (BS) space refrains them from revealing their full potential. There is a great need for automatic BS representations, which will replace the old hand-crafted, variable-length ones. To tackle those problems, we introduce a novel use of Autoencoders (AEs). Our goal is to obtain a low-dimensional, fixed-length, and compact, yet robust representation of the BS space. We investigate the use of dense AE, Denoising AE (DAE) and Variational Denoising AE (VDAE), which we combine with GP-SARSA to learn dialogue policies in the PyDial toolkit. In this framework, the BS is normally represented in a relatively compact, but still redundant summary space which is obtained through a heuristic mapping of the original master space. We show that all the proposed AE-based representations consistently outperform the summary BS representation. Especially, as the Semantic Error Rate (SER) increases, the DAE/VDAE-based representations obtain state-of-the-art and sample efficient performance.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"113 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132623589","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Joint Optimization of Classification and Clustering for Deep Speaker Embedding","authors":"Zhiming Wang, K. Yao, Shuo Fang, Xiaolong Li","doi":"10.1109/ASRU46091.2019.9003860","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003860","url":null,"abstract":"This paper proposes a method to train deep speaker embed-dings end-to-end that jointly optimizes classification and clustering. A large margin softmax loss is used to reduce classification errors. A novel large margin Gaussian mixture loss is proposed to improve clustering. With the joint optimization, the learned embeddings capture segment-level acoustic representation from variable-length speech segments to discriminate between speakers and to replicate densities of speaker clusters. We compare performance with alternative methods on large-scale text-independent speaker recognition dataset VoxCeleb1 [1] and observe that it outperforms those methods significantly, achieving new state-of-the-art results on the dataset. Moreover, because of the joint optimization, this method exhibits faster and better convergence than using classification loss alone. Our results suggest great potential of joint optimization of classification and clustering for speaker verification and identification.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132848306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On Temporal Context Information for Hybrid BLSTM-Based Phoneme Recognition","authors":"Timo Lohrenz, Maximilian Strake, T. Fingscheidt","doi":"10.1109/ASRU46091.2019.9003946","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003946","url":null,"abstract":"The modern approach to include long-term temporal context information into speech recognition systems is the use of recurrent neural networks, e.g., bi-directional long short-term memory (BLSTM) networks. In this paper, we decouple the BLSTM from a preceding CNN-based feature extractor network allowing us to investigate the use of temporal context in both models in a modular fashion. Accordingly, we train the BLSTMs on posteriors, stemming from preceding CNNs which use various amounts of limited context in their input layer, and investigate to what extent the BLSTM is able to effectively make use of its long-term modeling capabilities. We show that it is beneficial to train the BLSTM on posteriors stemming from a temporal context-free acoustic model. Remarkably, the best performing combination of CNN acoustic model and BLSTM afterwards is a large-context CNN (expected), followed by a BLSTM which has been trained on context-free CNN output posteriors (surprising).","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"382 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133162125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Development of Voice Spoofing Detection Systems for 2019 Edition of Automatic Speaker Verification and Countermeasures Challenge","authors":"João Monteiro, Md. Jahangir Alam","doi":"10.1109/ASRU46091.2019.9003792","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003792","url":null,"abstract":"A robust speaker verification system is expected to provide high recognition accuracy not only in adverse environments but also in the presence of spoofing attacks, which renders voice spoofing detection as crucial to prevent automatic speaker verification systems from a security breach. In this work, we present anti-spoofing systems developed for tackling spoofing attacks introduced for the ASVspoof 2019 challenge. We employ frame-level descriptors such as discrete Fourier transform, as well as constant Q transform-based spectral and cepstral features as countermeasures. These descriptors are both used on their own with a spoofing detection classifier to detect spoofing attacks, or in tandem with deep bottleneck features, i.e. approximate posteriors parametrized by a neural network designed to discriminate between bonafide and spoof signals. Fisher vector encoding and i-vector representations are further learned from the frame-level descriptors of the signals. For modeling, we employ two classification strategies. We finally build an end-to-end anti-spoofing system by making use of modified versions of light convolution neural networks as well as well-known ResNets. Our primary system for the logical access task and a single end-to-end system for the case of physical access we attain significant improvements over two baseline systems.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115152838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abhinav Garg, Dhananjaya N. Gowda, Ankur Kumar, Kwangyoun Kim, Mehul Kumar, Chanwoo Kim
{"title":"Improved Multi-Stage Training of Online Attention-Based Encoder-Decoder Models","authors":"Abhinav Garg, Dhananjaya N. Gowda, Ankur Kumar, Kwangyoun Kim, Mehul Kumar, Chanwoo Kim","doi":"10.1109/ASRU46091.2019.9003936","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003936","url":null,"abstract":"In this paper, we propose a refined multi-stage multi-task training strategy to improve the performance of online attention-based encoder-decoder (AED) models. A three-stage training based on three levels of architectural granularity namely, character encoder, byte pair encoding (BPE) based encoder, and attention decoder, is proposed. Also, multi-task learning based on two-levels of linguistic granularity namely, character and BPE, is used. We explore different pre-training strategies for the encoders including transfer learning from a bidirectional encoder. Our encoder-decoder models with online attention show ~35% and ~10% relative improvement over their baselines for smaller and bigger models, respectively. Our models achieve a word error rate (WER) of 5.04% and 4.48% on the Librispeech test-clean data for the smaller and bigger models respectively after fusion with long short-term memory (LSTM) based external language model (LM).","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132180756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"WaveNet Factorization with Singular Value Decomposition for Voice Conversion","authors":"Hongqiang Du, Xiaohai Tian, Lei Xie, Haizhou Li","doi":"10.1109/ASRU46091.2019.9003801","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003801","url":null,"abstract":"WaveNet vocoder has seen its great advantage over traditional vocoders in voice quality. However, it usually requires a relatively large amount of speech data to train a speaker-dependent WaveNet vocoder. Therefore, it remains a challenge to build a high-quality WaveNet vocoder for low resource tasks, e.g. voice conversion, where speech samples are limited in real applications. We propose to use singular value decomposition (SVD) to reduce WaveNet parameters while maintaining its output voice quality. Specifically, we apply SVD on dilated convolution layers, and impose semi-orthogonal constraint to improve the performance. Experiments conducted on CMU-ARCTIC database show that as compared with the original WaveNet vocoder, the proposed method maintains similar performance, in terms of both quality and similarity, while using much less training data.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134220262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ryo Masumura, Mana Ihori, Tomohiro Tanaka, Itsumi Saito, Kyosuke Nishida, T. Oba
{"title":"Generalized Large-Context Language Models Based on Forward-Backward Hierarchical Recurrent Encoder-Decoder Models","authors":"Ryo Masumura, Mana Ihori, Tomohiro Tanaka, Itsumi Saito, Kyosuke Nishida, T. Oba","doi":"10.1109/ASRU46091.2019.9003857","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003857","url":null,"abstract":"This paper presents a generalized form of large-context language models (LCLMs) that can take linguistic contexts beyond utterance boundaries into consideration. In discourse-level and conversation-level automatic speech recognition (ASR) tasks, which have to handle a series of utterances, it is essential to capture long-range linguistic contexts beyond utterance boundaries. The LCLMs of previous studies mainly focused on utilizing past contexts, and none fully utilized future contexts because LMs typically process words in a time-ordered manner. Our key idea is to introduce the LCLMs into the situation where ASR results of the whole series of utterances are given by a first decoding pass. This situation makes it possible for the LCLMs to leverage future contexts. In this paper, we propose generalized LCLMs (GLCLMs) based on forward-backward hierarchical recurrent encoder-decoder models in which generative probabilities of individual utterances are computed by leveraging not only past contexts but also future contexts beyond utterance boundaries. In order to efficiently introduce GLCLMs to ASR, we also propose a global-context iterative rescoring method that repeatedly rescores the ASR hypotheses of an individual utterance by using surrounding ASR hypotheses. Experiments on discourse-level ASR tasks demonstrate the effectiveness of our GLCLM approach.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128886236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Investigation of Shallow Wavenet Vocoder with Laplacian Distribution Output","authors":"Patrick Lumban Tobing, Tomoki Hayashi, T. Toda","doi":"10.1109/ASRU46091.2019.9003800","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003800","url":null,"abstract":"In this paper, an investigation of shallow architecture and Laplacian distribution output for WaveNet vocoder trained with limited training data is presented. The use of shallower WaveNet architecture is proposed to accommodate the possibility of more suitable use case with limited data and to reduce the computation time. In order to further improve the modeling of WaveNet vocoder, the use of Laplacian distribution output is proposed. Laplacian distribution is inherently a sparse distribution, with higher peak and fatter tail than the Gaussian, which might be more suitable for speech signal modeling. The experimental results demonstrate that: 1) the proposed shallow variant of WaveNet architecture gives comparable performance compared to the deep one with softmax output, while reducing the computation time by 73%; and 2) the use of Laplacian distribution output consistently improves the speech quality in various amounts of limited training data, reaching a value of 4.22 for the two highest mean opinion scores.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"240 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120888184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Data Augmentation Based on Vowel Stretch for Improving Children's Speech Recognition","authors":"Tohru Nagano, Takashi Fukuda, Masayuki Suzuki, Gakuto Kurata","doi":"10.1109/ASRU46091.2019.9003741","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003741","url":null,"abstract":"Prolongation is a speech disfluency that lengthens some portions of speech utterances. It is frequently observed in children's spontaneous speech, while it is rare in read speech. To make acoustic models more robust to children's spontaneous speech, collecting a large amount of children's speech data containing prolongation is usually required, which is very impractical in many cases. To tackle this problem, we propose a novel data augmentation method that virtually generates additional data by simulating prolongation. The method inserts pseudo frames into specific positions of speech utterances to simulate prolongation. The acoustic features of the inserted frames are calculated from the original frames on both sides. This is based on our analysis that many of vowels are actually stretched in children's spontaneous speech. Our proposed procedure can generate partially stretched utterances with low computational costs, unlike a conventional speed or tempo perturbation method that extends and shrinks entire utterances at a uniform rate. The effectiveness of the proposed method were confirmed with the experiments of acoustic model adaptations, in which our proposed method focusing on vowel stretch showed consistent improvement compared with conventional speed and tempo perturbation approach.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128339421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}