2018 IEEE Spoken Language Technology Workshop (SLT)最新文献

筛选
英文 中文
Improving ASR Error Detection with RNNLM Adaptation RNNLM自适应改进ASR错误检测
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639602
Rahhal Errattahi, S. Deena, A. Hannani, H. Ouahmane, Thomas Hain
{"title":"Improving ASR Error Detection with RNNLM Adaptation","authors":"Rahhal Errattahi, S. Deena, A. Hannani, H. Ouahmane, Thomas Hain","doi":"10.1109/SLT.2018.8639602","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639602","url":null,"abstract":"Applications of automatic speech recognition (ASR) such as broadcast transcription and dialog systems, can be helped by the ability to detect errors in the ASR output. The field of ASR error detection has emerged as a way to detect and subsequently correct ASR errors. The most common approach for ASR error detection is features-based, where a set of features are extracted from the ASR output and used to train a classifier to predict correct/incorrect labels.Language models (LMs), either from the ASR decoder or externally trained, can be used to provide features to an ASR error detection system, through scores computed on the ASR output. Recently, recurrent neural network language models (RNNLMs) features were proposed for ASR error detection with improvements to the classification rate, thanks to their ability to model longer-range context.RNNLM adaptation, through the introduction of auxiliary features that encode domain, has been shown to improve ASR performance. This work investigates whether RNNLM adaptation techniques can also improve ASR error detection performance in the context of multi-genre broadcast ASR. The results show that an overall improvement of about 1% in the F-measure can be achieved using adapted RNNLM features.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124661055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Improving Attention-Based End-to-End ASR Systems with Sequence-Based Loss Functions 基于序列的损失函数改进基于注意力的端到端ASR系统
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639587
Jia Cui, Chao Weng, Guangsen Wang, J. Wang, Peidong Wang, Chengzhu Yu, Dan Su, Dong Yu
{"title":"Improving Attention-Based End-to-End ASR Systems with Sequence-Based Loss Functions","authors":"Jia Cui, Chao Weng, Guangsen Wang, J. Wang, Peidong Wang, Chengzhu Yu, Dan Su, Dong Yu","doi":"10.1109/SLT.2018.8639587","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639587","url":null,"abstract":"Acoustic model and language model (LM) have been two major components in conventional speech recognition systems. They are normally trained independently, but recently there has been a trend to optimize both components simultaneously in a unified end-to-end (E2E) framework. However, the performance gap between the E2E systems and the traditional hybrid systems suggests that some knowledge has not yet been fully utilized in the new framework. An observation is that the current attention-based E2E systems could produce better recognition results when decoded with LMs which are independently trained with the same resource.In this paper, we focus on how to improve attention-based E2E systems without increasing model complexity or resorting to extra data. A novel training strategy is proposed for multi-task training with the connectionist temporal classification (CTC) loss. The sequence-based minimum Bayes risk (MBR) loss is also investigated. Our experiments on SWB 300hrs showed that both loss functions could significantly improve the baseline model performance. The additional gain from joint-LM decoding remains the same for CTC trained model but is only marginal for MBR trained model. This implies that while CTC loss function is able to capture more acoustic knowledge, MBR loss function exploits more word/character dependency.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127882485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Efficient Implementation of Recurrent Neural Network Transducer in Tensorflow 递归神经网络传感器在Tensorflow中的高效实现
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639690
Tom Bagby, Kanishka Rao, K. Sim
{"title":"Efficient Implementation of Recurrent Neural Network Transducer in Tensorflow","authors":"Tom Bagby, Kanishka Rao, K. Sim","doi":"10.1109/SLT.2018.8639690","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639690","url":null,"abstract":"Recurrent neural network transducer (RNN-T) has been successfully applied to automatic speech recognition to jointly learn the acoustic and language model components. The RNN-T loss and its gradient with respect to the softmax outputs can be computed efficiently using a forward-backward algorithm. In this paper, we present an efficient implementation of the RNN-T forward-backward and Viterbi algorithms using standard matrix operations. This allows us to easily implement the algorithm in TensorFlow by making use of the existing hardware-accelerated implementations of these operations. This work is based on a similar technique used in our previous work for computing the connectionist temporal classification and lattice-free maximum mutual information losses, where the forward and backward recursions are viewed as a bi-directional RNN whose states represent the forward and backward probabilities. Our benchmark results on graphic processing unit (GPU) and tensor processing unit (TPU) show that our implementation can achieve better throughput performance by increasing the batch size to maximize parallel computation. Furthermore, our implementation is about twice as fast on TPU compared to GPU for batch","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"134 10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125805421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 31
Context-Aware Dialog Re-Ranking for Task-Oriented Dialog Systems 面向任务的对话系统的上下文感知对话重新排序
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-11-28 DOI: 10.1109/SLT.2018.8639596
Junki Ohmura, M. Eskénazi
{"title":"Context-Aware Dialog Re-Ranking for Task-Oriented Dialog Systems","authors":"Junki Ohmura, M. Eskénazi","doi":"10.1109/SLT.2018.8639596","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639596","url":null,"abstract":"Dialog response ranking is used to rank response candidates by considering their relation to the dialog history. Although researchers have addressed this concept for open-domain dialogs, little attention has been focused on task-oriented dialogs. Furthermore, no previous studies have analyzed whether response ranking can improve the performance of existing dialog systems in real human–computer dialogs with speech recognition errors. In this paper, we propose a context-aware dialog response re-ranking system. Our system reranks responses in two steps: (1) it calculates matching scores for each candidate response and the current dialog context; (2) it combines the matching scores and a probability distribution of the candidates from an existing dialog system for response re-ranking. By using neural word embedding-based models and handcrafted or logistic regression-based ensemble models, we have improved the performance of a recently proposed end-to-end task-oriented dialog system on real dialogs with speech recognition errors.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132398992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Comprehensive Evaluation of Statistical Speech Waveform Synthesis 统计语音波形合成的综合评价
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-11-15 DOI: 10.1109/SLT.2018.8639556
Thomas Merritt, Bartosz Putrycz, Adam Nadolski, Tianjun Ye, Daniel Korzekwa, Wiktor Dolecki, Thomas Drugman, V. Klimkov, A. Moinet, A. Breen, Rafal Kuklinski, N. Strom, R. Barra-Chicote
{"title":"Comprehensive Evaluation of Statistical Speech Waveform Synthesis","authors":"Thomas Merritt, Bartosz Putrycz, Adam Nadolski, Tianjun Ye, Daniel Korzekwa, Wiktor Dolecki, Thomas Drugman, V. Klimkov, A. Moinet, A. Breen, Rafal Kuklinski, N. Strom, R. Barra-Chicote","doi":"10.1109/SLT.2018.8639556","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639556","url":null,"abstract":"Statistical TTS systems that directly predict the speech waveform have recently reported improvements in synthesis quality. This investigation evaluates Amazon’s statistical speech waveform synthesis (SSWS) system. An in-depth evaluation of SSWS is conducted across a number of domains to better understand the consistency in quality. The results of this evaluation are validated by repeating the procedure on a separate group of testers. Finally, an analysis of the nature of speech errors of SSWS compared to hybrid unit selection synthesis is conducted to identify the strengths and weaknesses of SSWS. Having a deeper insight into SSWS allows us to better define the focus of future work to improve this new technology.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"149 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121564834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Analyzing Deep CNN-Based Utterance Embeddings for Acoustic Model Adaptation 基于cnn的深度语音嵌入声学模型自适应分析
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-11-12 DOI: 10.1109/SLT.2018.8639036
Joanna Rownicka, P. Bell, S. Renals
{"title":"Analyzing Deep CNN-Based Utterance Embeddings for Acoustic Model Adaptation","authors":"Joanna Rownicka, P. Bell, S. Renals","doi":"10.1109/SLT.2018.8639036","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639036","url":null,"abstract":"We explore why deep convolutional neural networks (CNNs) with small two-dimensional kernels, primarily used for modeling spatial relations in images, are also effective in speech recognition. We analyze the representations learned by deep CNNs and compare them with deep neural network (DNN) representations and i-vectors, in the context of acoustic model adaptation. To explore whether interpretable information can be decoded from the learned representations we evaluate their ability to discriminate between speakers, acoustic conditions, noise type, and gender using the Aurora-4 dataset. We extract both whole model embeddings (to capture the information learned across the whole network) and layer-specific embeddings which enable understanding of the flow of information across the network. We also use learned representations as the additional input for a time-delay neural network (TDNN) for the Aurora-4 and MGB-3 English datasets. We find that deep CNN embeddings outperform DNN embeddings for acoustic model adaptation and auxiliary features based on deep CNN embeddings result in similar word error rates to i-vectors.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121079488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
User Modeling for Task Oriented Dialogues 面向任务对话的用户建模
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-11-11 DOI: 10.1109/SLT.2018.8639652
Izzeddin Gur, Dilek Z. Hakkani-Tür, Gökhan Tür, Pararth Shah
{"title":"User Modeling for Task Oriented Dialogues","authors":"Izzeddin Gur, Dilek Z. Hakkani-Tür, Gökhan Tür, Pararth Shah","doi":"10.1109/SLT.2018.8639652","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639652","url":null,"abstract":"We introduce end-to-end neural network based models for simulating users of task-oriented dialogue systems. User simulation in dialogue systems is crucial from two different perspectives: (i) automatic evaluation of different dialogue models, and (ii) training task-oriented dialogue systems. We design a hierarchical sequence-to-sequence model that first encodes the initial user goal and system turns into fixed length representations using Recurrent Neural Networks (RNN). It then encodes the dialogue history using another RNN layer. At each turn, user responses are decoded from the hidden representations of the dialogue level RNN. This hierarchical user simulator (HUS) approach allows the model to capture undiscovered parts of the user goal without the need of an explicit dialogue state tracking. We further develop several variants by utilizing a latent variable model to inject random variations into user responses to promote diversity in simulated user responses and a novel goal regularization mechanism to penalize divergence of user responses from the initial user goal. We evaluate the proposed models on movie ticket booking domain by systematically interacting each user simulator with various dialogue system policies trained with different objectives and users.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129220954","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 44
Towards Fluent Translations From Disfluent Speech 从不流利的言语到流利的翻译
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-11-07 DOI: 10.1109/SLT.2018.8639661
Elizabeth Salesky, Susanne Burger, J. Niehues, A. Waibel
{"title":"Towards Fluent Translations From Disfluent Speech","authors":"Elizabeth Salesky, Susanne Burger, J. Niehues, A. Waibel","doi":"10.1109/SLT.2018.8639661","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639661","url":null,"abstract":"When translating from speech, special consideration for conversational speech phenomena such as disfluencies is necessary. Most machine translation training data consists of well-formed written texts, causing issues when translating spontaneous speech. Previous work has introduced an intermediate step between speech recognition (ASR) and machine translation (MT) to remove disfluencies, making the data better-matched to typical translation text and significantly improving performance. However, with the rise of end-to-end speech translation systems, this intermediate step must be incorporated into the sequence-to-sequence architecture. Further, though translated speech datasets exist, they are typically news or rehearsed speech without many disfluencies (e.g. TED), or the disfluencies are translated into the references (e.g. Fisher). To generate clean translations from disfluent speech, cleaned references are necessary for evaluation. We introduce a corpus of cleaned target data for the Fisher Spanish-English dataset for this task. We compare how different architectures handle disfluencies and provide a baseline for removing disfluencies in end-to-end translation.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131225020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
Confidence Estimation and Deletion Prediction Using Bidirectional Recurrent Neural Networks 基于双向递归神经网络的置信度估计与缺失预测
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-10-30 DOI: 10.1109/SLT.2018.8639678
A. Ragni, Qiujia Li, M. Gales, Yu Wang
{"title":"Confidence Estimation and Deletion Prediction Using Bidirectional Recurrent Neural Networks","authors":"A. Ragni, Qiujia Li, M. Gales, Yu Wang","doi":"10.1109/SLT.2018.8639678","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639678","url":null,"abstract":"The standard approach to assess reliability of automatic speech transcriptions is through the use of confidence scores. If accurate, these scores provide a flexible mechanism to flag transcription errors for upstream and downstream applications. One challenging type of errors that recognisers make are deletions. These errors are not accounted for by the standard confidence estimation schemes and are hard to rectify in the upstream and downstream processing. High deletion rates are prominent in limited resource and highly mismatched training/testing conditions studied under IARPA Babel and Material programs. This paper looks at the use of bidirectional recurrent neural networks to yield confidence estimates in predicted as well as deleted words. Several simple schemes are examined for combination. To assess usefulness of this approach, the combined confidence score is examined for untranscribed data selection that favours transcriptions with lower deletion errors. Experiments are conducted using IARPA Babel/Material program languages.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114533804","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
American Sign Language Fingerspelling Recognition in the Wild 野外美国手语手指拼写识别
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-10-26 DOI: 10.1109/SLT.2018.8639639
Bowen Shi, Aurora Martinez Del Rio, J. Keane, Jonathan Michaux, D. Brentari, Gregory Shakhnarovich, Karen Livescu
{"title":"American Sign Language Fingerspelling Recognition in the Wild","authors":"Bowen Shi, Aurora Martinez Del Rio, J. Keane, Jonathan Michaux, D. Brentari, Gregory Shakhnarovich, Karen Livescu","doi":"10.1109/SLT.2018.8639639","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639639","url":null,"abstract":"We address the problem of American Sign Language fingerspelling recognition “in the wild”, using videos collected from websites. We introduce the largest data set available so far for the problem of fingerspelling recognition, and the first using naturally occurring video data. Using this data set, we present the first attempt to recognize fingerspelling sequences in this challenging setting. Unlike prior work, our video data is extremely challenging due to low frame rates and visual variability. To tackle the visual challenges, we train a special-purpose signing hand detector using a small subset of our data. Given the hand detector output, a sequence model decodes the hypothesized fingerspelled letter sequence. For the sequence model, we explore attention-based recurrent encoder-decoders and CTC-based approaches. As the first attempt at fingerspelling recognition in the wild, this work is intended to serve as a baseline for future work on sign language recognition in realistic conditions. We find that, as expected, letter error rates are much higher than in previous work on more controlled data, and we analyze the sources of error and effects of model variants.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124423606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 47
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信