2018 IEEE Spoken Language Technology Workshop (SLT)最新文献_第10页

Improving ASR Error Detection with RNNLM Adaptation RNNLM自适应改进ASR错误检测

2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639602

Rahhal Errattahi, S. Deena, A. Hannani, H. Ouahmane, Thomas Hain

{"title":"Improving ASR Error Detection with RNNLM Adaptation","authors":"Rahhal Errattahi, S. Deena, A. Hannani, H. Ouahmane, Thomas Hain","doi":"10.1109/SLT.2018.8639602","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639602","url":null,"abstract":"Applications of automatic speech recognition (ASR) such as broadcast transcription and dialog systems, can be helped by the ability to detect errors in the ASR output. The field of ASR error detection has emerged as a way to detect and subsequently correct ASR errors. The most common approach for ASR error detection is features-based, where a set of features are extracted from the ASR output and used to train a classifier to predict correct/incorrect labels.Language models (LMs), either from the ASR decoder or externally trained, can be used to provide features to an ASR error detection system, through scores computed on the ASR output. Recently, recurrent neural network language models (RNNLMs) features were proposed for ASR error detection with improvements to the classification rate, thanks to their ability to model longer-range context.RNNLM adaptation, through the introduction of auxiliary features that encode domain, has been shown to improve ASR performance. This work investigates whether RNNLM adaptation techniques can also improve ASR error detection performance in the context of multi-genre broadcast ASR. The results show that an overall improvement of about 1% in the F-measure can be achieved using adapted RNNLM features.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124661055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Improving Attention-Based End-to-End ASR Systems with Sequence-Based Loss Functions 基于序列的损失函数改进基于注意力的端到端ASR系统

2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639587

Jia Cui, Chao Weng, Guangsen Wang, J. Wang, Peidong Wang, Chengzhu Yu, Dan Su, Dong Yu

{"title":"Improving Attention-Based End-to-End ASR Systems with Sequence-Based Loss Functions","authors":"Jia Cui, Chao Weng, Guangsen Wang, J. Wang, Peidong Wang, Chengzhu Yu, Dan Su, Dong Yu","doi":"10.1109/SLT.2018.8639587","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639587","url":null,"abstract":"Acoustic model and language model (LM) have been two major components in conventional speech recognition systems. They are normally trained independently, but recently there has been a trend to optimize both components simultaneously in a unified end-to-end (E2E) framework. However, the performance gap between the E2E systems and the traditional hybrid systems suggests that some knowledge has not yet been fully utilized in the new framework. An observation is that the current attention-based E2E systems could produce better recognition results when decoded with LMs which are independently trained with the same resource.In this paper, we focus on how to improve attention-based E2E systems without increasing model complexity or resorting to extra data. A novel training strategy is proposed for multi-task training with the connectionist temporal classification (CTC) loss. The sequence-based minimum Bayes risk (MBR) loss is also investigated. Our experiments on SWB 300hrs showed that both loss functions could significantly improve the baseline model performance. The additional gain from joint-LM decoding remains the same for CTC trained model but is only marginal for MBR trained model. This implies that while CTC loss function is able to capture more acoustic knowledge, MBR loss function exploits more word/character dependency.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127882485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Efficient Implementation of Recurrent Neural Network Transducer in Tensorflow 递归神经网络传感器在Tensorflow中的高效实现

2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639690

Tom Bagby, Kanishka Rao, K. Sim

{"title":"Efficient Implementation of Recurrent Neural Network Transducer in Tensorflow","authors":"Tom Bagby, Kanishka Rao, K. Sim","doi":"10.1109/SLT.2018.8639690","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639690","url":null,"abstract":"Recurrent neural network transducer (RNN-T) has been successfully applied to automatic speech recognition to jointly learn the acoustic and language model components. The RNN-T loss and its gradient with respect to the softmax outputs can be computed efficiently using a forward-backward algorithm. In this paper, we present an efficient implementation of the RNN-T forward-backward and Viterbi algorithms using standard matrix operations. This allows us to easily implement the algorithm in TensorFlow by making use of the existing hardware-accelerated implementations of these operations. This work is based on a similar technique used in our previous work for computing the connectionist temporal classification and lattice-free maximum mutual information losses, where the forward and backward recursions are viewed as a bi-directional RNN whose states represent the forward and backward probabilities. Our benchmark results on graphic processing unit (GPU) and tensor processing unit (TPU) show that our implementation can achieve better throughput performance by increasing the batch size to maximize parallel computation. Furthermore, our implementation is about twice as fast on TPU compared to GPU for batch","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"134 10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125805421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 31

Context-Aware Dialog Re-Ranking for Task-Oriented Dialog Systems 面向任务的对话系统的上下文感知对话重新排序

2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-11-28 DOI: 10.1109/SLT.2018.8639596

Junki Ohmura, M. Eskénazi

引用次数: 5

Comprehensive Evaluation of Statistical Speech Waveform Synthesis 统计语音波形合成的综合评价

2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-11-15 DOI: 10.1109/SLT.2018.8639556

Thomas Merritt, Bartosz Putrycz, Adam Nadolski, Tianjun Ye, Daniel Korzekwa, Wiktor Dolecki, Thomas Drugman, V. Klimkov, A. Moinet, A. Breen, Rafal Kuklinski, N. Strom, R. Barra-Chicote

引用次数: 17

Analyzing Deep CNN-Based Utterance Embeddings for Acoustic Model Adaptation 基于cnn的深度语音嵌入声学模型自适应分析

2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-11-12 DOI: 10.1109/SLT.2018.8639036

Joanna Rownicka, P. Bell, S. Renals

引用次数: 11

User Modeling for Task Oriented Dialogues 面向任务对话的用户建模

2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-11-11 DOI: 10.1109/SLT.2018.8639652

Izzeddin Gur, Dilek Z. Hakkani-Tür, Gökhan Tür, Pararth Shah

{"title":"User Modeling for Task Oriented Dialogues","authors":"Izzeddin Gur, Dilek Z. Hakkani-Tür, Gökhan Tür, Pararth Shah","doi":"10.1109/SLT.2018.8639652","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639652","url":null,"abstract":"We introduce end-to-end neural network based models for simulating users of task-oriented dialogue systems. User simulation in dialogue systems is crucial from two different perspectives: (i) automatic evaluation of different dialogue models, and (ii) training task-oriented dialogue systems. We design a hierarchical sequence-to-sequence model that first encodes the initial user goal and system turns into fixed length representations using Recurrent Neural Networks (RNN). It then encodes the dialogue history using another RNN layer. At each turn, user responses are decoded from the hidden representations of the dialogue level RNN. This hierarchical user simulator (HUS) approach allows the model to capture undiscovered parts of the user goal without the need of an explicit dialogue state tracking. We further develop several variants by utilizing a latent variable model to inject random variations into user responses to promote diversity in simulated user responses and a novel goal regularization mechanism to penalize divergence of user responses from the initial user goal. We evaluate the proposed models on movie ticket booking domain by systematically interacting each user simulator with various dialogue system policies trained with different objectives and users.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129220954","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 44

Towards Fluent Translations From Disfluent Speech 从不流利的言语到流利的翻译

2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-11-07 DOI: 10.1109/SLT.2018.8639661

Elizabeth Salesky, Susanne Burger, J. Niehues, A. Waibel

{"title":"Towards Fluent Translations From Disfluent Speech","authors":"Elizabeth Salesky, Susanne Burger, J. Niehues, A. Waibel","doi":"10.1109/SLT.2018.8639661","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639661","url":null,"abstract":"When translating from speech, special consideration for conversational speech phenomena such as disfluencies is necessary. Most machine translation training data consists of well-formed written texts, causing issues when translating spontaneous speech. Previous work has introduced an intermediate step between speech recognition (ASR) and machine translation (MT) to remove disfluencies, making the data better-matched to typical translation text and significantly improving performance. However, with the rise of end-to-end speech translation systems, this intermediate step must be incorporated into the sequence-to-sequence architecture. Further, though translated speech datasets exist, they are typically news or rehearsed speech without many disfluencies (e.g. TED), or the disfluencies are translated into the references (e.g. Fisher). To generate clean translations from disfluent speech, cleaned references are necessary for evaluation. We introduce a corpus of cleaned target data for the Fisher Spanish-English dataset for this task. We compare how different architectures handle disfluencies and provide a baseline for removing disfluencies in end-to-end translation.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131225020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

Confidence Estimation and Deletion Prediction Using Bidirectional Recurrent Neural Networks 基于双向递归神经网络的置信度估计与缺失预测

2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-10-30 DOI: 10.1109/SLT.2018.8639678

A. Ragni, Qiujia Li, M. Gales, Yu Wang

引用次数: 30

American Sign Language Fingerspelling Recognition in the Wild 野外美国手语手指拼写识别

2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-10-26 DOI: 10.1109/SLT.2018.8639639

Bowen Shi, Aurora Martinez Del Rio, J. Keane, Jonathan Michaux, D. Brentari, Gregory Shakhnarovich, Karen Livescu

{"title":"American Sign Language Fingerspelling Recognition in the Wild","authors":"Bowen Shi, Aurora Martinez Del Rio, J. Keane, Jonathan Michaux, D. Brentari, Gregory Shakhnarovich, Karen Livescu","doi":"10.1109/SLT.2018.8639639","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639639","url":null,"abstract":"We address the problem of American Sign Language fingerspelling recognition “in the wild”, using videos collected from websites. We introduce the largest data set available so far for the problem of fingerspelling recognition, and the first using naturally occurring video data. Using this data set, we present the first attempt to recognize fingerspelling sequences in this challenging setting. Unlike prior work, our video data is extremely challenging due to low frame rates and visual variability. To tackle the visual challenges, we train a special-purpose signing hand detector using a small subset of our data. Given the hand detector output, a sequence model decodes the hypothesized fingerspelled letter sequence. For the sequence model, we explore attention-based recurrent encoder-decoders and CTC-based approaches. As the first attempt at fingerspelling recognition in the wild, this work is intended to serve as a baseline for future work on sign language recognition in realistic conditions. We find that, as expected, letter error rates are much higher than in previous work on more controlled data, and we analyze the sources of error and effects of model variants.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124423606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 47