2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)最新文献_第7页

Second Language Transfer Learning in Humans and Machines Using Image Supervision 基于图像监督的人类和机器的第二语言迁移学习

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9004011

K. Praveen, Anshul Gupta, Akshara Soman, Sriram Ganapathy

引用次数: 1

Improving Speech Enhancement with Phonetic Embedding Features 利用语音嵌入特征改进语音增强

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003987

Bo Wu, Meng Yu, Lianwu Chen, Mingjie Jin, Dan Su, Dong Yu

引用次数: 2

Streaming End-to-End Speech Recognition with Joint CTC-Attention Based Models 基于联合ctc -注意力模型的流媒体端到端语音识别

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003920

Niko Moritz, Takaaki Hori, Jonathan Le Roux

{"title":"Streaming End-to-End Speech Recognition with Joint CTC-Attention Based Models","authors":"Niko Moritz, Takaaki Hori, Jonathan Le Roux","doi":"10.1109/ASRU46091.2019.9003920","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003920","url":null,"abstract":"In this paper, we present a one-pass decoding algorithm for streaming recognition with joint connectionist temporal classification (CTC) and attention-based end-to-end automatic speech recognition (ASR) models. The decoding scheme is based on a frame-synchronous CTC prefix beam search algorithm and the recently proposed triggered attention concept. To achieve a fully streaming end-to-end ASR system, the CTC-triggered attention decoder is combined with a unidirectional encoder neural network based on parallel time-delayed long short-term memory (PTDLSTM) streams, which has demonstrated superior performance compared to various other streaming encoder architectures in earlier work. A new type of pre-training method is studied to further improve our streaming ASR models by adding residual connections to the encoder neural network and layer-wise removing them during the training process. The proposed joint CTC-triggered attention decoding algorithm, which enables streaming recognition of attention-based ASR systems, achieves similar ASR results compared to offline CTC-attention decoding and significantly better results compared to CTC prefix beam search decoding alone.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133811397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 39

Zero-Shot Pronunciation Lexicons for Cross-Language Acoustic Model Transfer 跨语言声学模型迁移的零发音词汇

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9004019

Matthew Wiesner, Oliver Adams, David Yarowsky, J. Trmal, S. Khudanpur

引用次数: 6

Incorporating Prior Knowledge into Speaker Diarization and Linking for Identifying Common Speaker 基于先验知识的说话人划分与连接识别共同说话人

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003731

Tsun-Yat Leung, Lahiru Samarakoon, Albert Y. S. Lam

{"title":"Incorporating Prior Knowledge into Speaker Diarization and Linking for Identifying Common Speaker","authors":"Tsun-Yat Leung, Lahiru Samarakoon, Albert Y. S. Lam","doi":"10.1109/ASRU46091.2019.9003731","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003731","url":null,"abstract":"Speaker Diarization and Linking discovers “who spoke when” across recordings without any speaker enrollment. Diarization is performed on each recording separately, and the linking combines clusters of the same speaker across recordings. It is a two-step approach, however it suffers from propagating the error from diarization step to the linking step. In a situation where a unique speaker appears in a given set of recordings, this paper aims at locating the common speaker using the prior knowledge of his or her existence. That means there is no enrollment data for this common speaker. We propose Pairwise Common Speaker Identification (PCSI) method that takes the existence of a common speaker into account in contrast to the two-step approach. We further show that PCSI can be used to reduce the errors that are introduced in the diarization step of the two-step approach. Our experiments are performed on a corpus synthesised from the AMI corpus and also on a in-house conversational telephony Sichuanese corpus that is mixed with Mandarin. We show up to 7.68% relative improvements of time-weighted equal error rate over a state-of-art x-vector diarization and linking system.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"108-109 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116726563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Spoken Language Identification Using Bidirectional LSTM Based LID Sequential Senones 基于双向LSTM的LID序列信号的口语识别

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003947

H. Muralikrishna, P. Sapra, Anuksha Jain, Dileep Aroor Dinesh

{"title":"Spoken Language Identification Using Bidirectional LSTM Based LID Sequential Senones","authors":"H. Muralikrishna, P. Sapra, Anuksha Jain, Dileep Aroor Dinesh","doi":"10.1109/ASRU46091.2019.9003947","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003947","url":null,"abstract":"The effectiveness of features used to represent speech utterances influences the performance of spoken language identification (LID) systems. Recent LID systems use bottleneck features (BNFs) obtained from deep neural networks (DNNs) to represent the utterances. These BNFs do not encode language-specific features. The recent advances in DNNs have led to the usage of effective language-sensitive features such as LID-senones, obtained using convolutional neural network (CNN) based architecture. In this work, we propose a novel approach to obtain LID-senones. The proposed approach combines BNF with bidirectional long short-term memory (BLSTM) networks to generate LID-senones. Since each LID-senones preserve sequence information, we term it as LID-sequential-senones (LID-seq-senones). The proposed LID-seq-senones are then used for LID in two ways. In the first approach, we propose to build an end-to-end structure with BLSTM as front end LID-seq-senones extractor followed by a fully connected classification layer. In the second approach, we consider each utterance as a sequence of LID-seq-senones and propose to use support vector machine (SVM) with sequence kernel (GMM-based segment level pyramid match kernel) to classify the utterance. The effectiveness of proposed representation is evaluated on Oregon graduate institute multi-language telephone speech corpus (OGI-TS) and IIT Madras Indian language corpus (IITM-IL).","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130379538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Joint Distribution Learning in the Framework of Variational Autoencoders for Far-Field Speech Enhancement 面向远场语音增强的变分自编码器框架中的联合分布学习

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9004024

Mahesh K. Chelimilla, Shashi Kumar, S. Rath

{"title":"Joint Distribution Learning in the Framework of Variational Autoencoders for Far-Field Speech Enhancement","authors":"Mahesh K. Chelimilla, Shashi Kumar, S. Rath","doi":"10.1109/ASRU46091.2019.9004024","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9004024","url":null,"abstract":"Far-field speech recognition is a challenging task as speech recognizers trained on close-talk speech do not generalize well to far-field speech. In order to handle such issues, neural network based speech enhancement is typically applied using denoising autoencoder (DA). Recently generative models have become more popular particularly in the field of image generation and translation. One of the popular techniques in this generative framework is variational autoencoder (VAE). In this paper we consider VAE for speech enhancement task in the context of automatic speech recognition (ASR). We propose a novel modification in the conventional VAE to model joint distribution of the far-field and close-talk features for a common latent space representation, which we refer to as joint-VAE. Unlike conventional VAE, joint-VAE involves one encoder network that projects the far-field features onto a latent space and two decoder networks that generate close-talk and far-field features separately. Experiments conducted on the AMI corpus show that it gives a relative WER improvement of 9% compared to conventional DA and a relative improvement of 19.2% compared to mismatched train and test scenario.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123951197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Online Batch Normalization Adaptation for Automatic Speech Recognition 自动语音识别的在线批处理归一化自适应

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003883

F. Mana, F. Weninger, R. Gemello, P. Zhan

{"title":"Online Batch Normalization Adaptation for Automatic Speech Recognition","authors":"F. Mana, F. Weninger, R. Gemello, P. Zhan","doi":"10.1109/ASRU46091.2019.9003883","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003883","url":null,"abstract":"Deep Neural Network (DNN) acoustic models are sensitive to the mismatch between training and testing environments. When a trained model is tested on unseen speakers, domain, or environment, recognition accuracy can degrade substantially. In such a case, offline adaptation with a fair amount of field data can improve recognition accuracy significantly, and is commonly applied to ASR systems in practice. Ideally, such kind of adaptation should be done online as well in order to catch any unexpected dynamic changes in the environments during the inference process. However, online adaptation is subject to strict constraints on computational cost. On the other hand, the small amount of available data and the nature of unsupervised adaptation make online adaptation a very challenging task, especially for DNN acoustic models which normally contain millions of parameters. In this paper, we introduce a simple and effective online adaptation technique to compensate training and testing mismatch for DNN acoustic models. It is done via online adaptation of the parameters associated with the batch normalization applied to the model training process. Our results show that this technique can improve accuracy significantly in a domain mismatched scenario for different DNN architectures.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"193 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124331738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Zero-Shot Code-Switching ASR and TTS with Multilingual Machine Speech Chain 基于多语言机器语音链的零码切换ASR和TTS

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003926

Sahoko Nakayama, Andros Tjandra, S. Sakti, Satoshi Nakamura

引用次数: 7

Paraphrase Generation Based on VAE and Pointer-Generator Networks 基于VAE和指针生成器网络的释义生成

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003874

Lohith Ravuru, Hyungtak Choi, M. SiddarthK., Hojung Lee, Inchul Hwang

引用次数: 2