K. Praveen, Anshul Gupta, Akshara Soman, Sriram Ganapathy
{"title":"Second Language Transfer Learning in Humans and Machines Using Image Supervision","authors":"K. Praveen, Anshul Gupta, Akshara Soman, Sriram Ganapathy","doi":"10.1109/ASRU46091.2019.9004011","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9004011","url":null,"abstract":"In the task of language learning, humans exhibit remarkable ability to learn new words from a foreign language with very few instances of image supervision. The question therefore is whether such transfer learning efficiency can be simulated in machines. In this paper, we propose a deep semantic model for transfer learning words from a foreign language (Japanese) using image supervision. The proposed model is a deep audio-visual correspondence network that uses a proxy based triplet loss. The model is trained with large dataset of multi-modal speech/image input in the native language (English). Then, a subset of the model parameters of the audio network are transfer learned to the foreign language words using proxy vectors from the image modality. Using the proxy based learning approach, we show that the proposed machine model achieves transfer learning performance for an image retrieval task which is comparable to the human performance. We also present an analysis that contrasts the errors made by humans and machines in this task.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133091279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bo Wu, Meng Yu, Lianwu Chen, Mingjie Jin, Dan Su, Dong Yu
{"title":"Improving Speech Enhancement with Phonetic Embedding Features","authors":"Bo Wu, Meng Yu, Lianwu Chen, Mingjie Jin, Dan Su, Dong Yu","doi":"10.1109/ASRU46091.2019.9003987","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003987","url":null,"abstract":"In this paper, we present a speech enhancement framework that leverages phonetic information obtained from the acoustic model. It consists of two separate components: (i) a long short-term memory recurrent neural network (LSTM-RNN) based speech enhancement model that takes the combination of log-power spectra (LPS) and phonetic embedding features as input to predict the complex ideal ratio mask (cIRM); and (ii) a convolutional, long short-term memory and fully connected deep neural network (CLDNN) based acoustic model that extracts the phonetic feature vector in the hidden units of its LSTM layer. Our experimental results show that the proposed framework outperforms both the conventional and phoneme-dependent speech enhancement systems under various noisy conditions, generalizes well to unseen conditions, and performs robustly to the speech interference. We further demonstrate its superior enhancement performance on unvoiced speech and report a preliminary yet promising recognition experiment on real test data.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133729774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Streaming End-to-End Speech Recognition with Joint CTC-Attention Based Models","authors":"Niko Moritz, Takaaki Hori, Jonathan Le Roux","doi":"10.1109/ASRU46091.2019.9003920","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003920","url":null,"abstract":"In this paper, we present a one-pass decoding algorithm for streaming recognition with joint connectionist temporal classification (CTC) and attention-based end-to-end automatic speech recognition (ASR) models. The decoding scheme is based on a frame-synchronous CTC prefix beam search algorithm and the recently proposed triggered attention concept. To achieve a fully streaming end-to-end ASR system, the CTC-triggered attention decoder is combined with a unidirectional encoder neural network based on parallel time-delayed long short-term memory (PTDLSTM) streams, which has demonstrated superior performance compared to various other streaming encoder architectures in earlier work. A new type of pre-training method is studied to further improve our streaming ASR models by adding residual connections to the encoder neural network and layer-wise removing them during the training process. The proposed joint CTC-triggered attention decoding algorithm, which enables streaming recognition of attention-based ASR systems, achieves similar ASR results compared to offline CTC-attention decoding and significantly better results compared to CTC prefix beam search decoding alone.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133811397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Matthew Wiesner, Oliver Adams, David Yarowsky, J. Trmal, S. Khudanpur
{"title":"Zero-Shot Pronunciation Lexicons for Cross-Language Acoustic Model Transfer","authors":"Matthew Wiesner, Oliver Adams, David Yarowsky, J. Trmal, S. Khudanpur","doi":"10.1109/ASRU46091.2019.9004019","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9004019","url":null,"abstract":"Existing acoustic models can be transferred to any language with a pronunciation lexicon (lexicon) that uses the same set of sub-word units as in training. Unfortunately such lexicons are not readily available in many low-resource languages. We bypass this requirement and create lexicons by training a grapheme-to-phoneme (G2P) transducer on a subset of words from other languages for which pronunciations are available. The subset of words is selected based on how representative it is of target language text. We find that cross-language acoustic model transfer using our selection strategy outperforms selection based on language similarity, and results in ASR performance approaching that of hand-crafted rule based lexicons in the majority of cases.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"131 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126162279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tsun-Yat Leung, Lahiru Samarakoon, Albert Y. S. Lam
{"title":"Incorporating Prior Knowledge into Speaker Diarization and Linking for Identifying Common Speaker","authors":"Tsun-Yat Leung, Lahiru Samarakoon, Albert Y. S. Lam","doi":"10.1109/ASRU46091.2019.9003731","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003731","url":null,"abstract":"Speaker Diarization and Linking discovers “who spoke when” across recordings without any speaker enrollment. Diarization is performed on each recording separately, and the linking combines clusters of the same speaker across recordings. It is a two-step approach, however it suffers from propagating the error from diarization step to the linking step. In a situation where a unique speaker appears in a given set of recordings, this paper aims at locating the common speaker using the prior knowledge of his or her existence. That means there is no enrollment data for this common speaker. We propose Pairwise Common Speaker Identification (PCSI) method that takes the existence of a common speaker into account in contrast to the two-step approach. We further show that PCSI can be used to reduce the errors that are introduced in the diarization step of the two-step approach. Our experiments are performed on a corpus synthesised from the AMI corpus and also on a in-house conversational telephony Sichuanese corpus that is mixed with Mandarin. We show up to 7.68% relative improvements of time-weighted equal error rate over a state-of-art x-vector diarization and linking system.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"108-109 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116726563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
H. Muralikrishna, P. Sapra, Anuksha Jain, Dileep Aroor Dinesh
{"title":"Spoken Language Identification Using Bidirectional LSTM Based LID Sequential Senones","authors":"H. Muralikrishna, P. Sapra, Anuksha Jain, Dileep Aroor Dinesh","doi":"10.1109/ASRU46091.2019.9003947","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003947","url":null,"abstract":"The effectiveness of features used to represent speech utterances influences the performance of spoken language identification (LID) systems. Recent LID systems use bottleneck features (BNFs) obtained from deep neural networks (DNNs) to represent the utterances. These BNFs do not encode language-specific features. The recent advances in DNNs have led to the usage of effective language-sensitive features such as LID-senones, obtained using convolutional neural network (CNN) based architecture. In this work, we propose a novel approach to obtain LID-senones. The proposed approach combines BNF with bidirectional long short-term memory (BLSTM) networks to generate LID-senones. Since each LID-senones preserve sequence information, we term it as LID-sequential-senones (LID-seq-senones). The proposed LID-seq-senones are then used for LID in two ways. In the first approach, we propose to build an end-to-end structure with BLSTM as front end LID-seq-senones extractor followed by a fully connected classification layer. In the second approach, we consider each utterance as a sequence of LID-seq-senones and propose to use support vector machine (SVM) with sequence kernel (GMM-based segment level pyramid match kernel) to classify the utterance. The effectiveness of proposed representation is evaluated on Oregon graduate institute multi-language telephone speech corpus (OGI-TS) and IIT Madras Indian language corpus (IITM-IL).","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130379538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Joint Distribution Learning in the Framework of Variational Autoencoders for Far-Field Speech Enhancement","authors":"Mahesh K. Chelimilla, Shashi Kumar, S. Rath","doi":"10.1109/ASRU46091.2019.9004024","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9004024","url":null,"abstract":"Far-field speech recognition is a challenging task as speech recognizers trained on close-talk speech do not generalize well to far-field speech. In order to handle such issues, neural network based speech enhancement is typically applied using denoising autoencoder (DA). Recently generative models have become more popular particularly in the field of image generation and translation. One of the popular techniques in this generative framework is variational autoencoder (VAE). In this paper we consider VAE for speech enhancement task in the context of automatic speech recognition (ASR). We propose a novel modification in the conventional VAE to model joint distribution of the far-field and close-talk features for a common latent space representation, which we refer to as joint-VAE. Unlike conventional VAE, joint-VAE involves one encoder network that projects the far-field features onto a latent space and two decoder networks that generate close-talk and far-field features separately. Experiments conducted on the AMI corpus show that it gives a relative WER improvement of 9% compared to conventional DA and a relative improvement of 19.2% compared to mismatched train and test scenario.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123951197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Online Batch Normalization Adaptation for Automatic Speech Recognition","authors":"F. Mana, F. Weninger, R. Gemello, P. Zhan","doi":"10.1109/ASRU46091.2019.9003883","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003883","url":null,"abstract":"Deep Neural Network (DNN) acoustic models are sensitive to the mismatch between training and testing environments. When a trained model is tested on unseen speakers, domain, or environment, recognition accuracy can degrade substantially. In such a case, offline adaptation with a fair amount of field data can improve recognition accuracy significantly, and is commonly applied to ASR systems in practice. Ideally, such kind of adaptation should be done online as well in order to catch any unexpected dynamic changes in the environments during the inference process. However, online adaptation is subject to strict constraints on computational cost. On the other hand, the small amount of available data and the nature of unsupervised adaptation make online adaptation a very challenging task, especially for DNN acoustic models which normally contain millions of parameters. In this paper, we introduce a simple and effective online adaptation technique to compensate training and testing mismatch for DNN acoustic models. It is done via online adaptation of the parameters associated with the batch normalization applied to the model training process. Our results show that this technique can improve accuracy significantly in a domain mismatched scenario for different DNN architectures.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"193 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124331738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sahoko Nakayama, Andros Tjandra, S. Sakti, Satoshi Nakamura
{"title":"Zero-Shot Code-Switching ASR and TTS with Multilingual Machine Speech Chain","authors":"Sahoko Nakayama, Andros Tjandra, S. Sakti, Satoshi Nakamura","doi":"10.1109/ASRU46091.2019.9003926","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003926","url":null,"abstract":"Constructing automatic speech recognition (ASR) and text-to-speech (TTS) for code-switching in a supervised fashion poses a challenge since a large amount of code-switching speech and the corresponding transcription are usually unavailable. The machine speech chain mechanism can be utilized to achieve semi-supervised learning. The framework enables ASR and TTS to assist each other when they receive unpaired data since it allows them to infer the missing pair and optimize the models with reconstruction loss. In this study, we handle multiple language pairs of code-switching by integrating language embeddings into the machine speech chain and investigate whether the model can perform with code-switching language pairs that are never explicitly seen during training. Experimental results reveal that the proposed approach improves the performance of the multilingual code-switching language pairs with which the model was trained and can also perform with unknown code-switching language pairs without directly learning on it.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117220492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lohith Ravuru, Hyungtak Choi, M. SiddarthK., Hojung Lee, Inchul Hwang
{"title":"Paraphrase Generation Based on VAE and Pointer-Generator Networks","authors":"Lohith Ravuru, Hyungtak Choi, M. SiddarthK., Hojung Lee, Inchul Hwang","doi":"10.1109/ASRU46091.2019.9003874","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003874","url":null,"abstract":"Paraphrase generation is a challenging task that involves expressing the meaning of a sentence using synonyms or different phrases, either to achieve variations or a certain stylistic response. Most previous sequence-to-sequence (Seq2Seq) models focus on either generating variations or preserving the content. We mainly address the issue of preserving the content in a sentence while generating diverse paraphrases. In this paper, we propose a novel approach for paraphrase generation using variational autoencoder (VAE) and Pointer Generator Network (PGN). The proposed model uses a copy mechanism to control the content transfer, a VAE to introduce variations and a training technique to restrict the gradient flow for efficient learning. Our evaluations on QUORA and MS COCO datasets show that our model outperforms the state-of-the-art approaches and the generated paraphrases are highly diverse as well as consistent with their original meaning.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122899760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}