2018 IEEE Spoken Language Technology Workshop (SLT)最新文献

筛选
英文 中文
Sentiment Classification on Erroneous ASR Transcripts: A Multi View Learning Approach 错误ASR转录物的情感分类:一种多视角学习方法
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639665
Sri Harsha Dumpala, I. Sheikh, Rupayan Chakraborty, Sunil Kumar Kopparapu
{"title":"Sentiment Classification on Erroneous ASR Transcripts: A Multi View Learning Approach","authors":"Sri Harsha Dumpala, I. Sheikh, Rupayan Chakraborty, Sunil Kumar Kopparapu","doi":"10.1109/SLT.2018.8639665","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639665","url":null,"abstract":"Sentiment classification on spoken language transcriptions has received less attention. A practical system employing the spoken language modality will have to use a language transcription from an Automatic Speech Recognition (ASR) engine which is inherently prone to errors. The main interest of this paper lies in improvement of sentiment classification on erroneous ASR transcriptions. Our aim is to improve the representation of the ASR transcripts using the manual transcripts and other modalities, like audio and visual, that are available during training but not necessarily during test conditions. We adopt an approach based on Deep Canonical Correlation Analysis (DCCA) and propose two new extensions of DCCA to enhance the ASR view using multiple modalities. We present a detailed evaluation of the performance of our approach on datasets of opinion videos (CMU-MOSI and CMU-MOSEI) collected from Youtube.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128379102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
[Title page] (标题页)
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/slt.2018.8639512
{"title":"[Title page]","authors":"","doi":"10.1109/slt.2018.8639512","DOIUrl":"https://doi.org/10.1109/slt.2018.8639512","url":null,"abstract":"","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129886873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Transliteration Based Approaches to Improve Code-Switched Speech Recognition Performance 基于音译的改进码交换语音识别性能的方法
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639699
Jesse Emond, B. Ramabhadran, Brian Roark, P. Moreno, Min Ma
{"title":"Transliteration Based Approaches to Improve Code-Switched Speech Recognition Performance","authors":"Jesse Emond, B. Ramabhadran, Brian Roark, P. Moreno, Min Ma","doi":"10.1109/SLT.2018.8639699","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639699","url":null,"abstract":"Code-switching is a commonly occurring phenomenon in many multilingual communities, wherein a speaker switches between languages within a single utterance. Conventional Word Error Rate (WER) is not sufficient for measuring the performance of code-mixed languages due to ambiguities in transcription, misspellings and borrowing of words from two different writing systems. These rendering errors artificially inflate the WER of an Automated Speech Recognition (ASR) system and complicate its evaluation. Furthermore, these errors make it harder to accurately evaluate modeling errors originating from code-switched language and acoustic models. In this work, we propose the use of a new metric, transliteration-optimized Word Error Rate (toWER) that smoothes out many of these irregularities by mapping all text to one writing system and demonstrate a correlation with the amount of code-switching present in a language. We also present a novel approach to acoustic and language modeling for bilingual code-switched Indic languages using the same transliteration approach to normalize the data for three types of language models, namely, a conventional n-gram language model, a maximum entropy based language model and a Long Short Term Memory (LSTM) language model, and a state-of-the-art Connectionist Temporal Classification (CTC) acoustic model. We demonstrate the robustness of the proposed approach on several Indic languages from Google Voice Search traffic with significant gains in ASR performance up to 10% relative over the state-of-the-art baseline.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"286 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130825756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
A New Timit Benchmark for Context-Independent Phone Recognition Using Turbo Fusion 基于Turbo融合的上下文无关手机识别新极限基准
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639670
Timo Lohrenz, Wei Li, T. Fingscheidt
{"title":"A New Timit Benchmark for Context-Independent Phone Recognition Using Turbo Fusion","authors":"Timo Lohrenz, Wei Li, T. Fingscheidt","doi":"10.1109/SLT.2018.8639670","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639670","url":null,"abstract":"In this work, we apply the recently proposed turbo fusion in conjunction with state-of-the-art convolutional neural networks as acoustic models to the standard phone recognition task on the TIMIT database. The turbo fusion operates on posterior streams stemming from standard filterbank features and from group delay (phase) features. By the iterative exchange of posterior information, the phone error rate is decreased down to 16.91% absolute, which is to our knowledge the best reported result on the TIMIT core test set so far using context-independent acoustic models, outperforming the previous respective benchmark by 4.4% relative.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130740861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Leveraging Sequence-to-Sequence Speech Synthesis for Enhancing Acoustic-to-Word Speech Recognition 利用序列到序列的语音合成增强声到词的语音识别
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639589
M. Mimura, Sei Ueno, H. Inaguma, S. Sakai, Tatsuya Kawahara
{"title":"Leveraging Sequence-to-Sequence Speech Synthesis for Enhancing Acoustic-to-Word Speech Recognition","authors":"M. Mimura, Sei Ueno, H. Inaguma, S. Sakai, Tatsuya Kawahara","doi":"10.1109/SLT.2018.8639589","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639589","url":null,"abstract":"Encoder-decoder models for acoustic-to-word (A2W) automatic speech recognition (ASR) are attractive for their simplicity of architecture and run-time latency while achieving state-of-the-art performances. However, word-based models commonly suffer from the-of-vocabulary (OOV) word problem. They also cannot leverage text data to improve their language modeling capability. Recently, sequence-to-sequence neural speech synthesis models trainable from corpora have been developed and shown to achieve naturalness com- parable to recorded human speech. In this paper, we explore how we can leverage the current speech synthesis technology to tailor the ASR system for a target domain by preparing only a relevant text corpus. From a set of target domain texts, we generate speech features using a sequence-to-sequence speech synthesizer. These artificial speech features together with real speech features from conventional speech corpora are used to train an attention-based A2W model. Experimental results show that the proposed approach improves the word accuracy significantly compared to the baseline trained only with the real speech, although synthetic part of the training data comes only from a single female speaker voice.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114347005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 41
Investigation of Users’ Short Responses in Actual Conversation System and Automatic Recognition of their Intentions 实际会话系统中用户简短反应及其意图自动识别研究
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639523
Katsuya Yokoyama, Hiroaki Takatsu, Hiroshi Honda, S. Fujie, Tetsunori Kobayashi
{"title":"Investigation of Users’ Short Responses in Actual Conversation System and Automatic Recognition of their Intentions","authors":"Katsuya Yokoyama, Hiroaki Takatsu, Hiroshi Honda, S. Fujie, Tetsunori Kobayashi","doi":"10.1109/SLT.2018.8639523","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639523","url":null,"abstract":"In human-human conversations, listeners often convey intentions to speakers through feedback consisting of reflexive short responses. The speakers recognize these intentions and change the conversational plans to make communication more efficient. These functions are expected to be effective in human-system conversations also; however, there is only a few systems using these functions or a research corpus including such functions. We created a corpus that consists of users’ short responses to an actual conversation system and developed a model for recognizing the intention of these responses. First, we categorized the intention of feedback that affects the progress of conversations. We then collected 15604 short responses of users from 2060 conversation sessions using our news-delivery conversation system. Twelve annotators labeled each utterance based on intention through a listening test. We then designed our deep-neural-network-based intention recognition model using the collected data. We found that feedback in the form of questions, which is the most frequently occurring expression, was correctly recognized and contributed to the efficiency of the conversation system.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114856153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Optimizing Neural Response Generator with Emotional Impact Information 利用情绪影响信息优化神经反应发生器
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639613
Nurul Lubis, S. Sakti, Koichiro Yoshino, Satoshi Nakamura
{"title":"Optimizing Neural Response Generator with Emotional Impact Information","authors":"Nurul Lubis, S. Sakti, Koichiro Yoshino, Satoshi Nakamura","doi":"10.1109/SLT.2018.8639613","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639613","url":null,"abstract":"The potential of dialogue systems to address user’s emotional need has steadily grown. In particular, we focus on dialogue systems application to promote positive emotional states, similar to that of emotional support between humans. Positive emotion elicitation takes form as chat-based dialogue interactions that is layered with an implicit goal to improve user’s emotional state. To this date, existing approaches have only relied on mimicking the target responses without considering their emotional impact, i.e. the change of emotional state they cause on the listener, in the model itself. In this paper, we propose explicitly utilizing emotional impact information to optimize neural dialogue system towards generating responses that elicit positive emotion. We examine two emotion-rich corpora with different data collection scenarios: Wizard-of-Oz and spontaneous. Evaluation shows that the proposed method yields lower perplexity, as well as produces responses that are perceived as more natural and likely to elicit a more positive emotion.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"232 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121137591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Efficient Building Strategy with Knowledge Distillation for Small-Footprint Acoustic Models 基于知识精馏的小足迹声学模型高效构建策略
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639545
Takafumi Moriya, Hiroki Kanagawa, Kiyoaki Matsui, Takaaki Fukutomi, Yusuke Shinohara, Y. Yamaguchi, M. Okamoto, Y. Aono
{"title":"Efficient Building Strategy with Knowledge Distillation for Small-Footprint Acoustic Models","authors":"Takafumi Moriya, Hiroki Kanagawa, Kiyoaki Matsui, Takaaki Fukutomi, Yusuke Shinohara, Y. Yamaguchi, M. Okamoto, Y. Aono","doi":"10.1109/SLT.2018.8639545","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639545","url":null,"abstract":"In this paper, we propose a novel training strategy for deep neural network (DNN) based small-footprint acoustic models. The accuracy of DNN-based automatic speech recognition (ASR) systems can be greatly improved by leveraging large amounts of data to improve the level of expression. DNNs use many parameters to enhance recognition performance. Unfortunately, resource-constrained local devices are unable to run complex DNN-based ASR systems. For building compact acoustic models, the knowledge distillation (KD) approach is often used. KD uses a large, well-trained model that outputs target labels to train a compact model. However, the standard KD cannot fully utilize the large model outputs to train compact models because the soft logits provide only rough information. We assume that the large model must give more useful hints to the compact model. We propose an advanced KD that uses mean squared error to minimize the discrepancies between the final hidden layer outputs. We evaluate our proposal on recorded speech data sets assuming car-and home-use scenarios, and show that our models achieve lower character error rates than the conventional KD approach or from-scratch training on computation resource-constrained devices.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124081347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Rapid Speaker Adaptation of Neural Network Based Filterbank Layer for Automatic Speech Recognition 基于神经网络的快速说话人自适应滤波库层自动语音识别
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639648
Hiroshi Seki, Kazumasa Yamamoto, T. Akiba, S. Nakagawa
{"title":"Rapid Speaker Adaptation of Neural Network Based Filterbank Layer for Automatic Speech Recognition","authors":"Hiroshi Seki, Kazumasa Yamamoto, T. Akiba, S. Nakagawa","doi":"10.1109/SLT.2018.8639648","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639648","url":null,"abstract":"Deep neural networks (DNN) have achieved significant success in the field of automatic speech recognition. Previously, we proposed a filterbank-incorporated DNN which takes power spectra as input features. This method has a function of VTLN (Vocal tract length normalization) and fMLLR (feature-space maximum likelihood linear regression). The filterbank layer can be implemented by using a small number of parameters and is optimized under a framework of backpropagation. Therefore, it is advantageous in adaptation under limited available data. In this paper, speaker adaptation is applied to the filterbank-incorporated DNN. By applying speaker adaptation using 15 utterances, the adapted model gave a 7.4% relative improvement in WER over the baseline DNN at a significance level of 0.005 on CSJ task. Adaptation of filterbank layer also showed better performance than the other adaptation methods; singular value decomposition (SVD) based adaptation and learning hidden unit contributions (LHUC).","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123779342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Exploring Layer Trajectory LSTM with Depth Processing Units and Attention 利用深度处理单元和注意力探索层轨迹LSTM
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639637
Jinyu Li, Liang Lu, Changliang Liu, Y. Gong
{"title":"Exploring Layer Trajectory LSTM with Depth Processing Units and Attention","authors":"Jinyu Li, Liang Lu, Changliang Liu, Y. Gong","doi":"10.1109/SLT.2018.8639637","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639637","url":null,"abstract":"Traditional LSTM model and its variants normally work in a frame-by-frame and layer-by-layer fashion, which deals with the temporal modeling and target classification problems at the same time. In this paper, we extend our recently proposed layer trajectory LSTM (ltLSTM) and present a generalized framework, which is equipped with a depth processing block that scans the hidden states of each time-LSTM layer, and uses the summarized layer trajectory information for final senone classification. We explore different modeling units used in the depth processing block to have a good tradeoff between accuracy and runtime cost. Furthermore, we integrate an attention module into this framework to explore wide context information, which is especially beneficial for uni-directional LSTMs. Trained with 30 thousand hours of EN-US Microsoft internal data and cross entropy criterion, the proposed generalized ltLSTM performed significantly better than the standard multi-layer time-LSTM, with up to 12.8% relative word error rate (WER) reduction across different tasks. With attention modeling, the relative WER reduction can be up to 17.9%. We observed similar gain when the models were trained with sequence discriminative training criterion.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123140661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信