2018 IEEE Spoken Language Technology Workshop (SLT)最新文献

筛选
英文 中文
Transliteration Based Approaches to Improve Code-Switched Speech Recognition Performance 基于音译的改进码交换语音识别性能的方法
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639699
Jesse Emond, B. Ramabhadran, Brian Roark, P. Moreno, Min Ma
{"title":"Transliteration Based Approaches to Improve Code-Switched Speech Recognition Performance","authors":"Jesse Emond, B. Ramabhadran, Brian Roark, P. Moreno, Min Ma","doi":"10.1109/SLT.2018.8639699","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639699","url":null,"abstract":"Code-switching is a commonly occurring phenomenon in many multilingual communities, wherein a speaker switches between languages within a single utterance. Conventional Word Error Rate (WER) is not sufficient for measuring the performance of code-mixed languages due to ambiguities in transcription, misspellings and borrowing of words from two different writing systems. These rendering errors artificially inflate the WER of an Automated Speech Recognition (ASR) system and complicate its evaluation. Furthermore, these errors make it harder to accurately evaluate modeling errors originating from code-switched language and acoustic models. In this work, we propose the use of a new metric, transliteration-optimized Word Error Rate (toWER) that smoothes out many of these irregularities by mapping all text to one writing system and demonstrate a correlation with the amount of code-switching present in a language. We also present a novel approach to acoustic and language modeling for bilingual code-switched Indic languages using the same transliteration approach to normalize the data for three types of language models, namely, a conventional n-gram language model, a maximum entropy based language model and a Long Short Term Memory (LSTM) language model, and a state-of-the-art Connectionist Temporal Classification (CTC) acoustic model. We demonstrate the robustness of the proposed approach on several Indic languages from Google Voice Search traffic with significant gains in ASR performance up to 10% relative over the state-of-the-art baseline.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"286 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130825756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
A New Timit Benchmark for Context-Independent Phone Recognition Using Turbo Fusion 基于Turbo融合的上下文无关手机识别新极限基准
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639670
Timo Lohrenz, Wei Li, T. Fingscheidt
{"title":"A New Timit Benchmark for Context-Independent Phone Recognition Using Turbo Fusion","authors":"Timo Lohrenz, Wei Li, T. Fingscheidt","doi":"10.1109/SLT.2018.8639670","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639670","url":null,"abstract":"In this work, we apply the recently proposed turbo fusion in conjunction with state-of-the-art convolutional neural networks as acoustic models to the standard phone recognition task on the TIMIT database. The turbo fusion operates on posterior streams stemming from standard filterbank features and from group delay (phase) features. By the iterative exchange of posterior information, the phone error rate is decreased down to 16.91% absolute, which is to our knowledge the best reported result on the TIMIT core test set so far using context-independent acoustic models, outperforming the previous respective benchmark by 4.4% relative.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130740861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Leveraging Sequence-to-Sequence Speech Synthesis for Enhancing Acoustic-to-Word Speech Recognition 利用序列到序列的语音合成增强声到词的语音识别
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639589
M. Mimura, Sei Ueno, H. Inaguma, S. Sakai, Tatsuya Kawahara
{"title":"Leveraging Sequence-to-Sequence Speech Synthesis for Enhancing Acoustic-to-Word Speech Recognition","authors":"M. Mimura, Sei Ueno, H. Inaguma, S. Sakai, Tatsuya Kawahara","doi":"10.1109/SLT.2018.8639589","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639589","url":null,"abstract":"Encoder-decoder models for acoustic-to-word (A2W) automatic speech recognition (ASR) are attractive for their simplicity of architecture and run-time latency while achieving state-of-the-art performances. However, word-based models commonly suffer from the-of-vocabulary (OOV) word problem. They also cannot leverage text data to improve their language modeling capability. Recently, sequence-to-sequence neural speech synthesis models trainable from corpora have been developed and shown to achieve naturalness com- parable to recorded human speech. In this paper, we explore how we can leverage the current speech synthesis technology to tailor the ASR system for a target domain by preparing only a relevant text corpus. From a set of target domain texts, we generate speech features using a sequence-to-sequence speech synthesizer. These artificial speech features together with real speech features from conventional speech corpora are used to train an attention-based A2W model. Experimental results show that the proposed approach improves the word accuracy significantly compared to the baseline trained only with the real speech, although synthetic part of the training data comes only from a single female speaker voice.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114347005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 41
Investigation of Users’ Short Responses in Actual Conversation System and Automatic Recognition of their Intentions 实际会话系统中用户简短反应及其意图自动识别研究
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639523
Katsuya Yokoyama, Hiroaki Takatsu, Hiroshi Honda, S. Fujie, Tetsunori Kobayashi
{"title":"Investigation of Users’ Short Responses in Actual Conversation System and Automatic Recognition of their Intentions","authors":"Katsuya Yokoyama, Hiroaki Takatsu, Hiroshi Honda, S. Fujie, Tetsunori Kobayashi","doi":"10.1109/SLT.2018.8639523","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639523","url":null,"abstract":"In human-human conversations, listeners often convey intentions to speakers through feedback consisting of reflexive short responses. The speakers recognize these intentions and change the conversational plans to make communication more efficient. These functions are expected to be effective in human-system conversations also; however, there is only a few systems using these functions or a research corpus including such functions. We created a corpus that consists of users’ short responses to an actual conversation system and developed a model for recognizing the intention of these responses. First, we categorized the intention of feedback that affects the progress of conversations. We then collected 15604 short responses of users from 2060 conversation sessions using our news-delivery conversation system. Twelve annotators labeled each utterance based on intention through a listening test. We then designed our deep-neural-network-based intention recognition model using the collected data. We found that feedback in the form of questions, which is the most frequently occurring expression, was correctly recognized and contributed to the efficiency of the conversation system.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114856153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Optimizing Neural Response Generator with Emotional Impact Information 利用情绪影响信息优化神经反应发生器
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639613
Nurul Lubis, S. Sakti, Koichiro Yoshino, Satoshi Nakamura
{"title":"Optimizing Neural Response Generator with Emotional Impact Information","authors":"Nurul Lubis, S. Sakti, Koichiro Yoshino, Satoshi Nakamura","doi":"10.1109/SLT.2018.8639613","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639613","url":null,"abstract":"The potential of dialogue systems to address user’s emotional need has steadily grown. In particular, we focus on dialogue systems application to promote positive emotional states, similar to that of emotional support between humans. Positive emotion elicitation takes form as chat-based dialogue interactions that is layered with an implicit goal to improve user’s emotional state. To this date, existing approaches have only relied on mimicking the target responses without considering their emotional impact, i.e. the change of emotional state they cause on the listener, in the model itself. In this paper, we propose explicitly utilizing emotional impact information to optimize neural dialogue system towards generating responses that elicit positive emotion. We examine two emotion-rich corpora with different data collection scenarios: Wizard-of-Oz and spontaneous. Evaluation shows that the proposed method yields lower perplexity, as well as produces responses that are perceived as more natural and likely to elicit a more positive emotion.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"232 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121137591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Efficient Building Strategy with Knowledge Distillation for Small-Footprint Acoustic Models 基于知识精馏的小足迹声学模型高效构建策略
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639545
Takafumi Moriya, Hiroki Kanagawa, Kiyoaki Matsui, Takaaki Fukutomi, Yusuke Shinohara, Y. Yamaguchi, M. Okamoto, Y. Aono
{"title":"Efficient Building Strategy with Knowledge Distillation for Small-Footprint Acoustic Models","authors":"Takafumi Moriya, Hiroki Kanagawa, Kiyoaki Matsui, Takaaki Fukutomi, Yusuke Shinohara, Y. Yamaguchi, M. Okamoto, Y. Aono","doi":"10.1109/SLT.2018.8639545","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639545","url":null,"abstract":"In this paper, we propose a novel training strategy for deep neural network (DNN) based small-footprint acoustic models. The accuracy of DNN-based automatic speech recognition (ASR) systems can be greatly improved by leveraging large amounts of data to improve the level of expression. DNNs use many parameters to enhance recognition performance. Unfortunately, resource-constrained local devices are unable to run complex DNN-based ASR systems. For building compact acoustic models, the knowledge distillation (KD) approach is often used. KD uses a large, well-trained model that outputs target labels to train a compact model. However, the standard KD cannot fully utilize the large model outputs to train compact models because the soft logits provide only rough information. We assume that the large model must give more useful hints to the compact model. We propose an advanced KD that uses mean squared error to minimize the discrepancies between the final hidden layer outputs. We evaluate our proposal on recorded speech data sets assuming car-and home-use scenarios, and show that our models achieve lower character error rates than the conventional KD approach or from-scratch training on computation resource-constrained devices.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124081347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Rapid Speaker Adaptation of Neural Network Based Filterbank Layer for Automatic Speech Recognition 基于神经网络的快速说话人自适应滤波库层自动语音识别
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639648
Hiroshi Seki, Kazumasa Yamamoto, T. Akiba, S. Nakagawa
{"title":"Rapid Speaker Adaptation of Neural Network Based Filterbank Layer for Automatic Speech Recognition","authors":"Hiroshi Seki, Kazumasa Yamamoto, T. Akiba, S. Nakagawa","doi":"10.1109/SLT.2018.8639648","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639648","url":null,"abstract":"Deep neural networks (DNN) have achieved significant success in the field of automatic speech recognition. Previously, we proposed a filterbank-incorporated DNN which takes power spectra as input features. This method has a function of VTLN (Vocal tract length normalization) and fMLLR (feature-space maximum likelihood linear regression). The filterbank layer can be implemented by using a small number of parameters and is optimized under a framework of backpropagation. Therefore, it is advantageous in adaptation under limited available data. In this paper, speaker adaptation is applied to the filterbank-incorporated DNN. By applying speaker adaptation using 15 utterances, the adapted model gave a 7.4% relative improvement in WER over the baseline DNN at a significance level of 0.005 on CSJ task. Adaptation of filterbank layer also showed better performance than the other adaptation methods; singular value decomposition (SVD) based adaptation and learning hidden unit contributions (LHUC).","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123779342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Exploring Layer Trajectory LSTM with Depth Processing Units and Attention 利用深度处理单元和注意力探索层轨迹LSTM
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639637
Jinyu Li, Liang Lu, Changliang Liu, Y. Gong
{"title":"Exploring Layer Trajectory LSTM with Depth Processing Units and Attention","authors":"Jinyu Li, Liang Lu, Changliang Liu, Y. Gong","doi":"10.1109/SLT.2018.8639637","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639637","url":null,"abstract":"Traditional LSTM model and its variants normally work in a frame-by-frame and layer-by-layer fashion, which deals with the temporal modeling and target classification problems at the same time. In this paper, we extend our recently proposed layer trajectory LSTM (ltLSTM) and present a generalized framework, which is equipped with a depth processing block that scans the hidden states of each time-LSTM layer, and uses the summarized layer trajectory information for final senone classification. We explore different modeling units used in the depth processing block to have a good tradeoff between accuracy and runtime cost. Furthermore, we integrate an attention module into this framework to explore wide context information, which is especially beneficial for uni-directional LSTMs. Trained with 30 thousand hours of EN-US Microsoft internal data and cross entropy criterion, the proposed generalized ltLSTM performed significantly better than the standard multi-layer time-LSTM, with up to 12.8% relative word error rate (WER) reduction across different tasks. With attention modeling, the relative WER reduction can be up to 17.9%. We observed similar gain when the models were trained with sequence discriminative training criterion.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123140661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Improving ASR Error Detection with RNNLM Adaptation RNNLM自适应改进ASR错误检测
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639602
Rahhal Errattahi, S. Deena, A. Hannani, H. Ouahmane, Thomas Hain
{"title":"Improving ASR Error Detection with RNNLM Adaptation","authors":"Rahhal Errattahi, S. Deena, A. Hannani, H. Ouahmane, Thomas Hain","doi":"10.1109/SLT.2018.8639602","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639602","url":null,"abstract":"Applications of automatic speech recognition (ASR) such as broadcast transcription and dialog systems, can be helped by the ability to detect errors in the ASR output. The field of ASR error detection has emerged as a way to detect and subsequently correct ASR errors. The most common approach for ASR error detection is features-based, where a set of features are extracted from the ASR output and used to train a classifier to predict correct/incorrect labels.Language models (LMs), either from the ASR decoder or externally trained, can be used to provide features to an ASR error detection system, through scores computed on the ASR output. Recently, recurrent neural network language models (RNNLMs) features were proposed for ASR error detection with improvements to the classification rate, thanks to their ability to model longer-range context.RNNLM adaptation, through the introduction of auxiliary features that encode domain, has been shown to improve ASR performance. This work investigates whether RNNLM adaptation techniques can also improve ASR error detection performance in the context of multi-genre broadcast ASR. The results show that an overall improvement of about 1% in the F-measure can be achieved using adapted RNNLM features.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124661055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Improving Attention-Based End-to-End ASR Systems with Sequence-Based Loss Functions 基于序列的损失函数改进基于注意力的端到端ASR系统
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639587
Jia Cui, Chao Weng, Guangsen Wang, J. Wang, Peidong Wang, Chengzhu Yu, Dan Su, Dong Yu
{"title":"Improving Attention-Based End-to-End ASR Systems with Sequence-Based Loss Functions","authors":"Jia Cui, Chao Weng, Guangsen Wang, J. Wang, Peidong Wang, Chengzhu Yu, Dan Su, Dong Yu","doi":"10.1109/SLT.2018.8639587","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639587","url":null,"abstract":"Acoustic model and language model (LM) have been two major components in conventional speech recognition systems. They are normally trained independently, but recently there has been a trend to optimize both components simultaneously in a unified end-to-end (E2E) framework. However, the performance gap between the E2E systems and the traditional hybrid systems suggests that some knowledge has not yet been fully utilized in the new framework. An observation is that the current attention-based E2E systems could produce better recognition results when decoded with LMs which are independently trained with the same resource.In this paper, we focus on how to improve attention-based E2E systems without increasing model complexity or resorting to extra data. A novel training strategy is proposed for multi-task training with the connectionist temporal classification (CTC) loss. The sequence-based minimum Bayes risk (MBR) loss is also investigated. Our experiments on SWB 300hrs showed that both loss functions could significantly improve the baseline model performance. The additional gain from joint-LM decoding remains the same for CTC trained model but is only marginal for MBR trained model. This implies that while CTC loss function is able to capture more acoustic knowledge, MBR loss function exploits more word/character dependency.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127882485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信