2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)最新文献_第10页

Language modeling with highway LSTM 高速LSTM语言建模

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-09-19 DOI: 10.1109/ASRU.2017.8268942

Gakuto Kurata, B. Ramabhadran, G. Saon, A. Sethy

引用次数: 38

Iterative policy learning in end-to-end trainable task-oriented neural dialog models 端到端可训练任务导向神经对话模型中的迭代策略学习

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-09-18 DOI: 10.1109/ASRU.2017.8268975

Bing Liu, Ian Lane

{"title":"Iterative policy learning in end-to-end trainable task-oriented neural dialog models","authors":"Bing Liu, Ian Lane","doi":"10.1109/ASRU.2017.8268975","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268975","url":null,"abstract":"In this paper, we present a deep reinforcement learning (RL) framework for iterative dialog policy optimization in end-to-end task-oriented dialog systems. Popular approaches in learning dialog policy with RL include letting a dialog agent to learn against a user simulator. Building a reliable user simulator, however, is not trivial, often as difficult as building a good dialog agent. We address this challenge by jointly optimizing the dialog agent and the user simulator with deep RL by simulating dialogs between the two agents. We first bootstrap a basic dialog agent and a basic user simulator by learning directly from dialog corpora with supervised training. We then improve them further by letting the two agents to conduct task-oriented dialogs and iteratively optimizing their policies with deep RL. Both the dialog agent and the user simulator are designed with neural network models that can be trained end-to-end. Our experiment results show that the proposed method leads to promising improvements on task success rate and total task reward comparing to supervised training and single-agent RL training baseline models.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129826060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 87

Unwritten languages demand attention too! Word discovery with encoder-decoder models 不成文的语言也需要注意!使用编码器-解码器模型的单词发现

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-09-17 DOI: 10.1109/ASRU.2017.8268972

Marcely Zanon Boito, Alexandre Berard, Aline Villavicencio, L. Besacier

引用次数: 22

Character-based units for unlimited vocabulary continuous speech recognition 基于字符的单位无限词汇连续语音识别

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-08-31 DOI: 10.1109/ASRU.2017.8268929

Peter Smit, Siva Charan Reddy Gangireddy, Seppo Enarvi, Sami Virpioja, M. Kurimo

引用次数: 14

Integrated speaker-adaptive speech synthesis 集成扬声器自适应语音合成

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-08-31 DOI: 10.1109/ASRU.2017.8269006

Moquan Wan, G. Degottex, M. Gales

{"title":"Integrated speaker-adaptive speech synthesis","authors":"Moquan Wan, G. Degottex, M. Gales","doi":"10.1109/ASRU.2017.8269006","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8269006","url":null,"abstract":"Enabling speech synthesis systems to rapidly adapt to sound like a particular speaker is an essential attribute for building personalised systems. For deep-learning based approaches, this is difficult as these networks use a highly distributed representation. It is not simple to interpret the model parameters, which complicates the adaptation process. To address this problem, speaker characteristics can be encapsulated in fixed-length speaker-specific Identity Vectors (iVectors), which are appended to the input of the synthesis network. Altering the iVector changes the nature of the synthesised speech. The challenge is to derive an optimal iVector for each speaker that encodes all the speaker attributes required for the synthesis system. The standard approach involves two separate stages: estimation of the iVectors for the training data; and training the synthesis network. This paper proposes an integrated training scheme for speaker adaptive speech synthesis. For the iVector extraction, an attention based mechanism, which is a function of the context labels, is used to combine the data from the target speaker. This attention mechanism, as well as nature of the features being merged, are optimised at the same time as the synthesis network parameters. This should yield an iVector-like speaker representation that is optimal for use with the synthesis system. The system is evaluated on the Voice Bank corpus. The resulting system automatically provides a sensible attention sequence and shows improved performance from the standard approach.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130223110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

MIT-QCRI Arabic dialect identification system for the 2017 multi-genre broadcast challenge 麻省理工学院- qcri阿拉伯语方言识别系统2017年多类型广播挑战

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-08-28 DOI: 10.1109/ASRU.2017.8268960

Suwon Shon, Ahmed M. Ali, James R. Glass

引用次数: 22

Exploring neural transducers for end-to-end speech recognition 探索端到端语音识别的神经传感器

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-07-24 DOI: 10.1109/ASRU.2017.8268937

Eric Battenberg, Jitong Chen, Rewon Child, Adam Coates, Yashesh Gaur Yi Li, Hairong Liu, S. Satheesh, Anuroop Sriram, Zhenyao Zhu

引用次数: 217

Language modeling with neural trans-dimensional random fields 基于神经跨维随机场的语言建模

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-07-23 DOI: 10.1109/ASRU.2017.8268949

Bin Wang, Zhijian Ou

{"title":"Language modeling with neural trans-dimensional random fields","authors":"Bin Wang, Zhijian Ou","doi":"10.1109/ASRU.2017.8268949","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268949","url":null,"abstract":"Trans-dimensional random field language models (TRF LMs) have recently been introduced, where sentences are modeled as a collection of random fields. The TRF approach has been shown to have the advantages of being computationally more efficient in inference than LSTM LMs with close performance and being able to flexibly integrate rich features. In this paper we propose neural TRFs, beyond of the previous discrete TRFs that only use linear potentials with discrete features. The idea is to use nonlinear potentials with continuous features, implemented by neural networks (NNs), in the TRF framework. Neural TRFs combine the advantages of both NNs and TRFs. The benefits of word embedding, nonlinear feature learning and larger context modeling are inherited from the use of NNs. At the same time, the strength of efficient inference by avoiding expensive softmax is preserved. A number of technical contributions, including employing deep convolutional neural networks (CNNs) to define the potentials and incorporating the joint stochastic approximation (JSA) strategy in the training algorithm, are developed in this work, which enable us to successfully train neural TRF LMs. Various LMs are evaluated in terms of speech recognition WERs by rescoring the 1000-best lists of WSJ'92 test data. The results show that neural TRF LMs not only improve over discrete TRF LMs, but also perform slightly better than LSTM LMs with only one fifth of parameters and 16x faster inference efficiency.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115263405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Unsupervised domain adaptation for robust speech recognition via variational autoencoder-based data augmentation 基于变分自编码器的数据增强的无监督域自适应鲁棒语音识别

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-07-19 DOI: 10.1109/ASRU.2017.8268911

Wei-Ning Hsu, Yu Zhang, James R. Glass

{"title":"Unsupervised domain adaptation for robust speech recognition via variational autoencoder-based data augmentation","authors":"Wei-Ning Hsu, Yu Zhang, James R. Glass","doi":"10.1109/ASRU.2017.8268911","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268911","url":null,"abstract":"Domain mismatch between training and testing can lead to significant degradation in performance in many machine learning scenarios. Unfortunately, this is not a rare situation for automatic speech recognition deployments in real-world applications. Research on robust speech recognition can be regarded as trying to overcome this domain mismatch issue. In this paper, we address the unsupervised domain adaptation problem for robust speech recognition, where both source and target domain speech are available, but word transcripts are only available for the source domain speech. We present novel augmentation-based methods that transform speech in a way that does not change the transcripts. Specifically, we first train a variational autoencoder on both source and target domain data (without supervision) to learn a latent representation of speech. We then transform nuisance attributes of speech that are irrelevant to recognition by modifying the latent representations, in order to augment labeled training data with additional data whose distribution is more similar to the target domain. The proposed method is evaluated on the CHiME-4 dataset and reduces the absolute word error rate (WER) by as much as 35% compared to the non-adapted baseline.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124487374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 106

Listening while speaking: Speech chain by deep learning 边听边说:深度学习语音链

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-07-16 DOI: 10.1109/ASRU.2017.8268950

Andros Tjandra, S. Sakti, Satoshi Nakamura

{"title":"Listening while speaking: Speech chain by deep learning","authors":"Andros Tjandra, S. Sakti, Satoshi Nakamura","doi":"10.1109/ASRU.2017.8268950","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268950","url":null,"abstract":"Despite the close relationship between speech perception and production, research in automatic speech recognition (ASR) and text-to-speech synthesis (TTS) has progressed more or less independently without exerting much mutual influence on each other. In human communication, on the other hand, a closed-loop speech chain mechanism with auditory feedback from the speaker's mouth to her ear is crucial. In this paper, we take a step further and develop a closed-loop speech chain model based on deep learning. The sequence-to-sequence model in close-loop architecture allows us to train our model on the concatenation of both labeled and unlabeled data. While ASR transcribes the unlabeled speech features, TTS attempts to reconstruct the original speech waveform based on the text from ASR. In the opposite direction, ASR also attempts to reconstruct the original text transcription given the synthesized speech. To the best of our knowledge, this is the first deep learning model that integrates human speech perception and production behaviors. Our experimental results show that the proposed approach significantly improved the performance more than separate systems that were only trained with labeled data.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122828362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 152