{"title":"Exploring Effective Data Augmentation with TDNN-LSTM Neural Network Embedding for Speaker Recognition","authors":"Chien-Lin Huang","doi":"10.1109/ASRU46091.2019.9003938","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003938","url":null,"abstract":"The speaker characterization using four different data augmentation methods and time delay neural networks and long short-term memory neural networks (TDNN-LSTM) is proposed in this paper. The proposed data augmentation is used to increase the amount and diversity of the training data including adding speed perturbation, adding volume perturbation, adding room impulse responses, and adding additive noises. The idea of TDNN-LSTM based speaker embedding is better to capture the temporal information in speaker speech than the conventional TDNN based x-vectors. The proposed methods were trained on VoxCeleb dataset and tested with Speakers In The Wild (SITW) dataset in the evaluation core-core condition. We achieved results of EER=1.86% and a minimum decision cost function (DCF) of 0.204 at p-target=0.01, and a minimum DCF of 0.368 at p-target=0.001. The proposed methods outperform the baselines of both i-vector and x-vector.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121463497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Takuya Yoshioka, Igor Abramovski, Cem Aksoylar, Zhuo Chen, Moshe David, D. Dimitriadis, Y. Gong, I. Gurvich, Xuedong Huang, Yan-ping Huang, Aviv Hurvitz, Li Jiang, S. Koubi, Eyal Krupka, Ido Leichter, Changliang Liu, P. Parthasarathy, Alon Vinnikov, Lingfeng Wu, Xiong Xiao, Wayne Xiong, Huaming Wang, Zhenghao Wang, Jun Zhang, Yong Zhao, Tianyan Zhou
{"title":"Advances in Online Audio-Visual Meeting Transcription","authors":"Takuya Yoshioka, Igor Abramovski, Cem Aksoylar, Zhuo Chen, Moshe David, D. Dimitriadis, Y. Gong, I. Gurvich, Xuedong Huang, Yan-ping Huang, Aviv Hurvitz, Li Jiang, S. Koubi, Eyal Krupka, Ido Leichter, Changliang Liu, P. Parthasarathy, Alon Vinnikov, Lingfeng Wu, Xiong Xiao, Wayne Xiong, Huaming Wang, Zhenghao Wang, Jun Zhang, Yong Zhao, Tianyan Zhou","doi":"10.1109/ASRU46091.2019.9003827","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003827","url":null,"abstract":"This paper describes a system that generates speaker-annotated transcripts of meetings by using a microphone array and a 360-degree camera. The hallmark of the system is its ability to handle overlapped speech, which has been an unsolved problem in realistic settings for over a decade. We show that this problem can be addressed by using a continuous speech separation approach. In addition, we describe an online audio-visual speaker diarization method that leverages face tracking and identification, sound source localization, speaker identification, and, if available, prior speaker information for robustness to various real world challenges. All components are integrated in a meeting transcription framework called SRD, which stands for “separate, recognize, and diarize”. Experimental results using recordings of natural meetings involving up to 11 attendees are reported. The continuous speech separation improves a word error rate (WER) by 16.1% compared with a highly tuned beamformer. When a complete list of meeting attendees is available, the discrepancy between WER and speaker-attributed WER is only 1.0%, indicating accurate word-to-speaker association. This increases marginally to 1.6% when 50% of the attendees are unknown to the system.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121574604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Language Model Bootstrapping Using Neural Machine Translation for Conversational Speech Recognition","authors":"Surabhi Punjabi, Harish Arsikere, S. Garimella","doi":"10.1109/ASRU46091.2019.9003982","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003982","url":null,"abstract":"Building conversational speech recognition systems for new languages is constrained by the availability of utterances capturing user-device interactions. Data collection is expensive and limited by speed of manual transcription. In order to address this, we advocate the use of neural machine translation as a data augmentation technique for bootstrapping language models. Machine translation (MT) offers a systematic way of incorporating collections from mature, resource-rich conversational systems that may be available for a different language. However, ingesting raw translations from a general purpose MT system may not be effective owing to the presence of named entities, intra sentential code-switching and the domain mismatch between the conversational data being translated and the parallel text used for MT training. To circumvent this, we explore following domain adaptation techniques: (a) sentence embedding based data selection for MT training, (b) model finetuning, and (c) rescoring and filtering translated hypotheses. Using Hindi language as the experimental testbed, we supplement transcribed collections with translated US English utterances. We observe a relative word error rate reduction of 7.8-15.6%, depending on the bootstrapping phase. Fine grained analysis reveals that translation particularly aids the interaction scenarios underrepresented in the transcribed data.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"198 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124439660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Tacotron-Based Acoustic Model Using Phoneme Alignment for Practical Neural Text-to-Speech Systems","authors":"T. Okamoto, T. Toda, Y. Shiga, H. Kawai","doi":"10.1109/ASRU46091.2019.9003956","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003956","url":null,"abstract":"Although sequence-to-sequence (seq2seq) models with attention mechanism in neural text-to-speech (TTS) systems, such as Tacotron 2, can jointly optimize duration and acoustic models, and realize high-fidelity synthesis compared with conventional duration-acoustic pipeline models, these involve a risk that speech samples cannot be sometimes successfully synthesized due to the attention prediction errors. Therefore, these seq2seq models cannot be directly introduced in practical TTS systems. On the other hand, the conventional pipeline models are broadly used in practical TTS systems since there are few crucial prediction errors in the duration model. For realizing high-quality practical TTS systems without attention prediction errors, this paper investigates Tacotron-based acoustic models with phoneme alignment instead of attention. The phoneme durations are first obtained from HMM-based forced alignment and the duration model is a simple bidirectional LSTM-based network. Then, a seq2seq model with forced alignment instead of attention is investigated and an alternative model with Tacotron decoder and phoneme duration is proposed. The results of experiments with full-context label input using WaveGlow vocoder indicate that the proposed model can realize a high-fidelity TTS system for Japanese with a real-time factor of 0.13 using a GPU without attention prediction errors compared with the seq2seq models.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121544140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bayesian Adversarial Learning for Speaker Recognition","authors":"Jen-Tzung Chien, Chun Lin Kuo","doi":"10.1109/ASRU46091.2019.9004033","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9004033","url":null,"abstract":"This paper presents a new generative adversarial network (GAN) which artificially generates the i-vectors to compensate the imbalanced or insufficient data in speaker recognition based on the probabilistic linear discriminant analysis. Theoretically, GAN is powerful to generate the artificial data which are misclassified as the real data. However, GAN suffers from the mode collapse problem in two-player optimization over generator and discriminator. This study deals with this challenge by improving the model regularization through characterizing the weight uncertainty in GAN. A new Bayesian GAN is implemented to learn a regularized model from diverse data where the strong modes are flattened via the marginalization. In particular, we present a variational GAN (VGAN) where the encoder, generator and discriminator are jointly estimated according to the variational inference. The computation cost is significantly reduced. To assure the preservation of gradient values, the learning objective based on Wasserstein distance is further introduced. The issues of model collapse and gradient vanishing are alleviated. Experiments on NIST i-vector Speaker Recognition Challenge demonstrate the superiority of the proposed VGAN to the variational autoencoder, the standard GAN and the Bayesian GAN based on the sampling method. The learning efficiency and generation performance are evaluated.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116907060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Neural Machine Translation with Acoustic Embedding","authors":"Takatomo Kano, S. Sakti, Satoshi Nakamura","doi":"10.1109/ASRU46091.2019.9003802","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003802","url":null,"abstract":"Neural machine translation (NMT) has successfully redefined the state of the art in machine translation on several language pairs. One popular framework models the translation process end-to-end using attentional encoder-decoder architecture and treats each word in the vectors of intermediate representation. These embedding vectors are sensitive to the meaning of words and allow semantically similar words to be near each other in the vector spaces and share their statistical power. Unfortunately, the model often maps such similar words too closely, which complicates distinguishing them. Consequently, NMT systems often mistranslate words that seem natural in the context but do not reflect the content of the source sentence. Incorporating auxiliary information usually enhances the discriminability. In this research, we integrate acoustic information within NMT by multi-task learning. Here, our model learns how to embed and translate word sequences based on their acoustic and semantic differences by helping it choose the correct output word based on its meaning and pronunciation. Our experiment results show that our proposed approach provides more significant improvement than the standard text-based transformer NMT model in BLEU score evaluation.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129259763","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dialogue Environments are Different from Games: Investigating Variants of Deep Q-Networks for Dialogue Policy","authors":"Yu-An Wang, Yun-Nung (Vivian) Chen","doi":"10.1109/ASRU46091.2019.9003840","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003840","url":null,"abstract":"The dialogue manager is an important component in a task-oriented dialogue system, which focuses on deciding dialogue policy given the dialogue state in order to fulfill the user goal. Learning dialogue policy is usually framed as a reinforcement learning (RL) problem, where the objective is to maximize the reward indicating whether the conversation is successful and how efficient it is. However, even there are many variants of deep Q-networks (DQN) achieving better performance on game playing scenarios, no prior work analyzed the performance of dialogue policy learning using these improved versions. Considering that dialogue interactions differ a lot from game playing, this paper investigates variants of DQN models together with different exploration strategies in a benchmark experimental setup, and then we examine which RL methods are more suitable for task-completion dialogue policy learning11The code is available at https://github.com/MiuLab/Dialogue-DQN-Variants.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130501068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Title","authors":"D. Baca, Kai Petersen","doi":"10.1109/asru46091.2019.9003789","DOIUrl":"https://doi.org/10.1109/asru46091.2019.9003789","url":null,"abstract":"Software security is an important quality aspect of a software system. Therefore, it is important to integrate software security touch points throughout the development life-cycle. So far, the focus of touch points in the early phases has been on the identification of threats and attacks. In this paper we propose a novel method focusing on the end product by prioritizing countermeasures. The method provides an extension to attack trees and a process for identification and prioritization of countermeasures. The approach has been applied on an opensource application and showed that countermeasures could be identified. Furthermore, an analysis of the effectiveness and cost-efficiency of the countermeasures could be provided.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131241593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tianyan Zhou, Yong Zhao, Jinyu Li, Y. Gong, Jian Wu
{"title":"CNN with Phonetic Attention for Text-Independent Speaker Verification","authors":"Tianyan Zhou, Yong Zhao, Jinyu Li, Y. Gong, Jian Wu","doi":"10.1109/ASRU46091.2019.9003826","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003826","url":null,"abstract":"Text-independent speaker verification imposes no constraints on the spoken content and usually needs long observations to make reliable prediction. In this paper, we propose two speaker embedding approaches by integrating the phonetic information into the attention-based residual convolutional neural network (CNN). Phonetic features are extracted from the bottleneck layer of a pretrained acoustic model. In implicit phonetic attention (IPA), the phonetic features are projected by a transformation network into multi-channel feature maps, and then combined with the raw acoustic features as the input of the CNN network. In explicit phonetic attention (EPA), the phonetic features are directly connected to the attentive pooling layer through a separate 1-dim CNN to generate the attention weights. With the incorporation of spoken content and attention mechanism, the system can not only distill the speaker-discriminant frames but also actively normalize the phonetic variations. Multi-head attention and discriminative objectives are further studied to improve the system. Experiments on the VoxCeleb corpus show our proposed system could outperform the state-of-the-art by around 43% relative.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"182 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124604525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Explicit Alignment of Text and Speech Encodings for Attention-Based End-to-End Speech Recognition","authors":"Jennifer Drexler, James R. Glass","doi":"10.1109/ASRU46091.2019.9003873","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003873","url":null,"abstract":"In this work, we present a novel training procedure for attention-based end-to-end automatic speech recognition. Our goal is to push the encoder network to output only linguistic information, improving generalization performance particularly in low-resource scenarios. We accomplish this with the addition of a text encoder network, which the speech encoder is encouraged to mimic. Our main innovation is the comparison of the attention-weighted speech encoder outputs to the outputs of the text encoder - this guarantees two sequences of the same length that can be directly aligned. We show that our training procedure significantly decreases word error rates in all experiments and has the biggest absolute impact in the lowest resource scenarios.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114405189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}