2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)最新文献_第9页

Speaker and Language Aware Training for End-to-End ASR 端到端ASR的说话者和语言意识训练

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9004000

Shubham Bansal, Karan Malhotra, Sriram Ganapathy

{"title":"Speaker and Language Aware Training for End-to-End ASR","authors":"Shubham Bansal, Karan Malhotra, Sriram Ganapathy","doi":"10.1109/ASRU46091.2019.9004000","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9004000","url":null,"abstract":"The end-to-end (E2E) approach to automatic speech recognition (ASR) is a simplified and an elegant approach where a single deep neural network model directly converts the acoustic feature sequence to the text sequence. The current approach to end-to-end ASR uses the neural network model (trained with sequence loss) along with an external character/word based language model (LM) in a decoding pass to output the text sequence. In this work, we propose a new objective function for end-to-end ASR training where the LM score is explicitly introduced in the attention model loss function without any additional training parameters. In this manner, the neural network is made LM aware and this simplifies the model training process. We also propose to incorporate an attention based sequence summary feature in the ASR model which allows the system to be speaker aware. With several E2E ASR experiments on TED-LIUM, WSJ and Librispeech datasets, we show that the proposed speaker and LM aware training improves the ASR performance significantly over the state-of-art E2E approaches. We achieve the best published results reported for WSJ dataset.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"150 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127698194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Attention-Based Speech Recognition Using Gaze Information 基于注视信息的语音识别

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9004030

Osamu Segawa, Tomoki Hayashi, K. Takeda

引用次数: 0

Syllable-Dependent Discriminative Learning for Small Footprint Text-Dependent Speaker Verification 小足迹文本依赖说话人验证的音节依赖判别学习

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9004023

Junyi Peng, Yuexian Zou, N. Li, Deyi Tuo, Dan Su, Meng Yu, Chunlei Zhang, Dong Yu

{"title":"Syllable-Dependent Discriminative Learning for Small Footprint Text-Dependent Speaker Verification","authors":"Junyi Peng, Yuexian Zou, N. Li, Deyi Tuo, Dan Su, Meng Yu, Chunlei Zhang, Dong Yu","doi":"10.1109/ASRU46091.2019.9004023","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9004023","url":null,"abstract":"This study proposes a novel scheme of syllable-dependent discriminative speaker embedding learning for small footprint text-dependent speaker verification systems. To suppress undesired syllable variation and enhance the power of discrimination inherited in the frame-level features, we design a novel syllable-dependent clustering loss to optimize the network. Specifically, this loss function utilizes syllable labels as auxiliary supervision information to explicitly maximize inter-syllable divisibility and intra-syllable compactness between the learned frame-level features. Successively, we propose two syllable-dependent pooling mechanisms to aggregate the frame-level features to several syllable-level features by averaging those features corresponding to each syllable. The utterance-level speaker embeddings with powerful discrimination are then obtained by concatenating the syllable-level features. Experimental results on Tencent voice wake-up dataset show that our proposed scheme can accelerate the network convergence and achieve significant performance improvement against the state-of-the-art methods.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114333525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Speaker Verification with Application-Aware Beamforming 应用感知波束成形的说话人验证

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003932

Ladislav Mošner, Oldrich Plchot, Johan Rohdin, L. Burget, J. Černocký

引用次数: 4

Enhanced Bert-Based Ranking Models for Spoken Document Retrieval 基于bert的口语文档检索增强排序模型

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003890

Hsiao-Yun Lin, Tien-Hong Lo, Berlin Chen

引用次数: 8

Controlling Emotion Strength with Relative Attribute for End-to-End Speech Synthesis 基于相对属性的端到端语音合成情感强度控制

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003829

Xiaolian Zhu, Shan Yang, Geng Yang, Lei Xie

引用次数: 37

Verifying Deep Keyword Spotting Detection with Acoustic Word Embeddings 声学词嵌入的深度关键词识别验证

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003781

Yougen Yuan, Zhiqiang Lv, Shen Huang, Lei Xie

{"title":"Verifying Deep Keyword Spotting Detection with Acoustic Word Embeddings","authors":"Yougen Yuan, Zhiqiang Lv, Shen Huang, Lei Xie","doi":"10.1109/ASRU46091.2019.9003781","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003781","url":null,"abstract":"In this paper, in order to improve keyword spotting (KWS) performance in a live broadcast scenario, we propose to use a template matching method based on acoustic word embeddings (AWE) as the second stage to verify the detection from the Deep KWS system. AWEs are obtained via a deep bidirectional long short-term memory (BLSTM) network trained using limited positive and negative keyword candidates, which aims to encode variable-length keyword candidates into fixed-dimensional vectors with reasonable discriminative ability. Learning AWEs takes a combination of three specifically-designed losses: the triplet and reversed triplet losses try to keep same keyword candidates closer and different keyword candidates farther, while the hinge loss is to set a fixed threshold to distinguish all positive and negative keyword candidates. During keyword verification, calibration scores are used to reduce the bias between different templates for different keyword candidates. Experiments show that adding AWE-based keyword verification to Deep KWS achieves 5.6% relative accuracy improvement; the hinge loss brings additional 5.5% relative gain and the final accuracy climbs to 0.775 by using calibration scores.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134474214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Efficient Free Keyword Detection Based on CNN and End-to-End Continuous DP-Matching 基于CNN和端到端连续dp匹配的高效免费关键字检测

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9004021

Tomohiro Tanaka, T. Shinozaki

引用次数: 3

GANs for Children: A Generative Data Augmentation Strategy for Children Speech Recognition 儿童GANs:儿童语音识别的生成数据增强策略

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003933

Peiyao Sheng, Zhuolin Yang, Y. Qian

{"title":"GANs for Children: A Generative Data Augmentation Strategy for Children Speech Recognition","authors":"Peiyao Sheng, Zhuolin Yang, Y. Qian","doi":"10.1109/ASRU46091.2019.9003933","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003933","url":null,"abstract":"Due to the high acoustic variability, children speech recognition suffers significant performance reduction on most ASR systems which are optimized mainly using adults speech with limited or even none children speech. One of the most straight ideas to solve this problem is to increase the children's speech data during training, however, it is restricted by the more difficult process and higher cost when collecting children's speech compared to adults'. In this work, we develop a generative adversarial network (GANs) based data augmentation method to increase the size of children's training data to improve speech recognition performance for children's speech. Two different types of GANs are explored under WGAN-GP training framework, including the unconditional GANs with an unsupervised learning framework and the conditional GANs using acoustic states as conditions. The proposed data augmentation approaches are evaluated on a Mandarin speech recognition task, with only 40-hour children speech or further including 100-hour adult speech in the training. The results show that more than relative 20% WER reduction can be obtained on children speech testset with the proposed method, and the generated children speech with GAN even can improve the adults' speech within our experimental setups.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"107 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116360126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

End-to-End Training of a Large Vocabulary End-to-End Speech Recognition System 大词汇量端到端语音识别系统的端到端训练

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003976

Chanwoo Kim, Sungsoo Kim, Kwangyoun Kim, Mehul Kumar, Jiyeon Kim, Kyungmin Lee, C. Han, Abhinav Garg, Eunhyang Kim, Minkyoo Shin, Shatrughan Singh, Larry Heck, Dhananjaya N. Gowda

{"title":"End-to-End Training of a Large Vocabulary End-to-End Speech Recognition System","authors":"Chanwoo Kim, Sungsoo Kim, Kwangyoun Kim, Mehul Kumar, Jiyeon Kim, Kyungmin Lee, C. Han, Abhinav Garg, Eunhyang Kim, Minkyoo Shin, Shatrughan Singh, Larry Heck, Dhananjaya N. Gowda","doi":"10.1109/ASRU46091.2019.9003976","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003976","url":null,"abstract":"In this paper, we present an end-to-end training framework for building state-of-the-art end-to-end speech recognition systems. Our training system utilizes a cluster of Central Processing Units (CPUs) and Graphics Processing Units (GPUs). The entire data reading, large scale data augmentation, neural network parameter updates are all performed “on-the-fly”. We use vocal tract length perturbation [1] and an acoustic simulator [2] for data augmentation. The processed features and labels are sent to the GPU cluster. The Horovod allreduce approach is employed to train neural network parameters. We evaluated the effectiveness of our system on the standard Librispeech corpus [3] and the 10,000-hr anonymized Bixby English dataset. Our end-to-end speech recognition system built using this training infrastructure showed a 2.44 % WER on test-clean of the LibriSpeech test set after applying shallow fusion with a Transformer language model (LM). For the proprietary English Bixby open domain test set, we obtained a WER of 7.92 % using a Bidirectional Full Attention (BFA) end-to-end model after applying shallow fusion with an RNN-LM. When the monotonic chunckwise attention (MoCha) based approach is employed for streaming speech recognition, we obtained a WER of 9.95 % on the same Bixby open domain test set.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"209 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121984254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 26