{"title":"Speaker and Language Aware Training for End-to-End ASR","authors":"Shubham Bansal, Karan Malhotra, Sriram Ganapathy","doi":"10.1109/ASRU46091.2019.9004000","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9004000","url":null,"abstract":"The end-to-end (E2E) approach to automatic speech recognition (ASR) is a simplified and an elegant approach where a single deep neural network model directly converts the acoustic feature sequence to the text sequence. The current approach to end-to-end ASR uses the neural network model (trained with sequence loss) along with an external character/word based language model (LM) in a decoding pass to output the text sequence. In this work, we propose a new objective function for end-to-end ASR training where the LM score is explicitly introduced in the attention model loss function without any additional training parameters. In this manner, the neural network is made LM aware and this simplifies the model training process. We also propose to incorporate an attention based sequence summary feature in the ASR model which allows the system to be speaker aware. With several E2E ASR experiments on TED-LIUM, WSJ and Librispeech datasets, we show that the proposed speaker and LM aware training improves the ASR performance significantly over the state-of-art E2E approaches. We achieve the best published results reported for WSJ dataset.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"150 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127698194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Attention-Based Speech Recognition Using Gaze Information","authors":"Osamu Segawa, Tomoki Hayashi, K. Takeda","doi":"10.1109/ASRU46091.2019.9004030","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9004030","url":null,"abstract":"We assume that there is a correlation between an utterance and a corresponding gaze object, and propose a new paradigm of multi-modal end-to-end speech recognition using multimodal information, namely, utterances and corresponding gaze points. In our method, the system extracts acoustic features and corresponding images around gaze points, and inputs the information into the proposed attention-based multiple encoder-decoder networks. This makes it possible to integrate the two different modalities, and the performance of speech recognition is improved. To evaluate the proposed method, we prepared a simulation task of power-line control operations, and built a corpus that contains utterances and corresponding gaze points in the operations. We conducted an experimental evaluation using this corpus, and the results showed the reduction in the CER, suggesting the effectiveness of the proposed method in which acoustic features and gaze information are integrated.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124358796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Junyi Peng, Yuexian Zou, N. Li, Deyi Tuo, Dan Su, Meng Yu, Chunlei Zhang, Dong Yu
{"title":"Syllable-Dependent Discriminative Learning for Small Footprint Text-Dependent Speaker Verification","authors":"Junyi Peng, Yuexian Zou, N. Li, Deyi Tuo, Dan Su, Meng Yu, Chunlei Zhang, Dong Yu","doi":"10.1109/ASRU46091.2019.9004023","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9004023","url":null,"abstract":"This study proposes a novel scheme of syllable-dependent discriminative speaker embedding learning for small footprint text-dependent speaker verification systems. To suppress undesired syllable variation and enhance the power of discrimination inherited in the frame-level features, we design a novel syllable-dependent clustering loss to optimize the network. Specifically, this loss function utilizes syllable labels as auxiliary supervision information to explicitly maximize inter-syllable divisibility and intra-syllable compactness between the learned frame-level features. Successively, we propose two syllable-dependent pooling mechanisms to aggregate the frame-level features to several syllable-level features by averaging those features corresponding to each syllable. The utterance-level speaker embeddings with powerful discrimination are then obtained by concatenating the syllable-level features. Experimental results on Tencent voice wake-up dataset show that our proposed scheme can accelerate the network convergence and achieve significant performance improvement against the state-of-the-art methods.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114333525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ladislav Mošner, Oldrich Plchot, Johan Rohdin, L. Burget, J. Černocký
{"title":"Speaker Verification with Application-Aware Beamforming","authors":"Ladislav Mošner, Oldrich Plchot, Johan Rohdin, L. Burget, J. Černocký","doi":"10.1109/ASRU46091.2019.9003932","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003932","url":null,"abstract":"Multichannel speech processing applications usually employ beamformers as means of speech enhancement through spatial filtering. Beamformers with learnable parameters require training to minimize a loss function that is not necessarily correlated with the final objective. In this paper, we present a framework employing recent neural network based generalized eigenvalue beamformer and application-specific model that allows for optimization of beamformer w.r.t. target application. In our case, the application is speaker verification which utilizes a speaker embedding (x-vector) extractor that conveniently comes with desired loss. We show that application-specific training of the beamformer brings performance improvements over a system trained in the standard way. We perform our analysis on the recently introduced VOiCES corpus which contains multichannel data and allows us to modify the evaluation trials such that enrollment recordings remain single-channel and test utterances are multichannel.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132745450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhanced Bert-Based Ranking Models for Spoken Document Retrieval","authors":"Hsiao-Yun Lin, Tien-Hong Lo, Berlin Chen","doi":"10.1109/ASRU46091.2019.9003890","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003890","url":null,"abstract":"The Bidirectional Encoder Representations from Transformers (BERT) model has recently achieved record-breaking success on many natural language processing (NLP) tasks such as question answering and language understanding. However, relatively little work has been done on ad-hoc information retrieval (IR), especially for spoken document retrieval (SDR). This paper adopts and extends BERT for SDR, while its contributions are at least three-fold. First, we augment BERT with extra language features such as unigram and inverse document frequency (IDF) statistics to make it more applicable to SDR. Second, we also explore the incorporation of confidence scores into document representations to see if they could help alleviate the negative effects resulting from imperfect automatic speech recognition (ASR). Third, we conduct a comprehensive set of experiments to compare our BERT-based ranking methods with other state-of-the-art ones and investigate the synergy effect of them as well.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134123850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Controlling Emotion Strength with Relative Attribute for End-to-End Speech Synthesis","authors":"Xiaolian Zhu, Shan Yang, Geng Yang, Lei Xie","doi":"10.1109/ASRU46091.2019.9003829","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003829","url":null,"abstract":"Recently, attention-based end-to-end speech synthesis has achieved superior performance compared to traditional speech synthesis models, and several approaches like global style tokens are proposed to explore the style controllability of the end-to-end model. Although the existing methods show good performance in style disentanglement and transfer, it is still unable to control the explicit emotion of generated speech. In this paper, we mainly focus on the subtle control of expressive speech synthesis, where the emotion category and strength can be easily controlled with a discrete emotional vector and a continuous simple scalar, respectively. The continuous strength controller is learned by a ranking function according to the relative attribute measured on an emotion dataset. Our method automatically learns the relationship between low-level acoustic features and high-level subtle emotion strength. Experiments show that our method can effectively improve the controllability for an expressive end-to-end model.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123815765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Verifying Deep Keyword Spotting Detection with Acoustic Word Embeddings","authors":"Yougen Yuan, Zhiqiang Lv, Shen Huang, Lei Xie","doi":"10.1109/ASRU46091.2019.9003781","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003781","url":null,"abstract":"In this paper, in order to improve keyword spotting (KWS) performance in a live broadcast scenario, we propose to use a template matching method based on acoustic word embeddings (AWE) as the second stage to verify the detection from the Deep KWS system. AWEs are obtained via a deep bidirectional long short-term memory (BLSTM) network trained using limited positive and negative keyword candidates, which aims to encode variable-length keyword candidates into fixed-dimensional vectors with reasonable discriminative ability. Learning AWEs takes a combination of three specifically-designed losses: the triplet and reversed triplet losses try to keep same keyword candidates closer and different keyword candidates farther, while the hinge loss is to set a fixed threshold to distinguish all positive and negative keyword candidates. During keyword verification, calibration scores are used to reduce the bias between different templates for different keyword candidates. Experiments show that adding AWE-based keyword verification to Deep KWS achieves 5.6% relative accuracy improvement; the hinge loss brings additional 5.5% relative gain and the final accuracy climbs to 0.775 by using calibration scores.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134474214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Free Keyword Detection Based on CNN and End-to-End Continuous DP-Matching","authors":"Tomohiro Tanaka, T. Shinozaki","doi":"10.1109/ASRU46091.2019.9004021","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9004021","url":null,"abstract":"For continuous keyword detection, the advantage of dynamic programming (DP) matching is that it can detect any keyword without re-training the system. In previous research, higher detection accuracy was reported using 2D-RNN based DP matching than using conventional DP and embedding methods. However, 2D-RNN based DP matching has a high computational cost. In order to address this problem, we combine a convolutional neural network (CNN) and 2D-RNN based DP matching into a unified framework which, based on the kernel size and the number of CNN layers, has a polynomial order effect on reducing the computational cost. Experimental results, using Google Speech Commands Dataset and the CHiME-3 challenge's noise data, demonstrate that our proposed model improves open keyword detection performance, compared to the embedding-based baseline system, while it is nine times faster than previous 2D-RNN DP matching.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"306 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114278084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GANs for Children: A Generative Data Augmentation Strategy for Children Speech Recognition","authors":"Peiyao Sheng, Zhuolin Yang, Y. Qian","doi":"10.1109/ASRU46091.2019.9003933","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003933","url":null,"abstract":"Due to the high acoustic variability, children speech recognition suffers significant performance reduction on most ASR systems which are optimized mainly using adults speech with limited or even none children speech. One of the most straight ideas to solve this problem is to increase the children's speech data during training, however, it is restricted by the more difficult process and higher cost when collecting children's speech compared to adults'. In this work, we develop a generative adversarial network (GANs) based data augmentation method to increase the size of children's training data to improve speech recognition performance for children's speech. Two different types of GANs are explored under WGAN-GP training framework, including the unconditional GANs with an unsupervised learning framework and the conditional GANs using acoustic states as conditions. The proposed data augmentation approaches are evaluated on a Mandarin speech recognition task, with only 40-hour children speech or further including 100-hour adult speech in the training. The results show that more than relative 20% WER reduction can be obtained on children speech testset with the proposed method, and the generated children speech with GAN even can improve the adults' speech within our experimental setups.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"107 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116360126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chanwoo Kim, Sungsoo Kim, Kwangyoun Kim, Mehul Kumar, Jiyeon Kim, Kyungmin Lee, C. Han, Abhinav Garg, Eunhyang Kim, Minkyoo Shin, Shatrughan Singh, Larry Heck, Dhananjaya N. Gowda
{"title":"End-to-End Training of a Large Vocabulary End-to-End Speech Recognition System","authors":"Chanwoo Kim, Sungsoo Kim, Kwangyoun Kim, Mehul Kumar, Jiyeon Kim, Kyungmin Lee, C. Han, Abhinav Garg, Eunhyang Kim, Minkyoo Shin, Shatrughan Singh, Larry Heck, Dhananjaya N. Gowda","doi":"10.1109/ASRU46091.2019.9003976","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003976","url":null,"abstract":"In this paper, we present an end-to-end training framework for building state-of-the-art end-to-end speech recognition systems. Our training system utilizes a cluster of Central Processing Units (CPUs) and Graphics Processing Units (GPUs). The entire data reading, large scale data augmentation, neural network parameter updates are all performed “on-the-fly”. We use vocal tract length perturbation [1] and an acoustic simulator [2] for data augmentation. The processed features and labels are sent to the GPU cluster. The Horovod allreduce approach is employed to train neural network parameters. We evaluated the effectiveness of our system on the standard Librispeech corpus [3] and the 10,000-hr anonymized Bixby English dataset. Our end-to-end speech recognition system built using this training infrastructure showed a 2.44 % WER on test-clean of the LibriSpeech test set after applying shallow fusion with a Transformer language model (LM). For the proprietary English Bixby open domain test set, we obtained a WER of 7.92 % using a Bidirectional Full Attention (BFA) end-to-end model after applying shallow fusion with an RNN-LM. When the monotonic chunckwise attention (MoCha) based approach is employed for streaming speech recognition, we obtained a WER of 9.95 % on the same Bixby open domain test set.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"209 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121984254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}