2016 IEEE Spoken Language Technology Workshop (SLT)最新文献

筛选
英文 中文
Recognizing emotions in spoken dialogue with hierarchically fused acoustic and lexical features 通过语音和词汇特征的层次融合来识别口语对话中的情绪
2016 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846319
Leimin Tian, Johanna D. Moore, Catherine Lai
{"title":"Recognizing emotions in spoken dialogue with hierarchically fused acoustic and lexical features","authors":"Leimin Tian, Johanna D. Moore, Catherine Lai","doi":"10.1109/SLT.2016.7846319","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846319","url":null,"abstract":"Automatic emotion recognition is vital for building natural and engaging human-computer interaction systems. Combining information from multiple modalities typically improves emotion recognition performance. In previous work, features from different modalities have generally been fused at the same level with two types of fusion strategies: Feature-Level fusion, which concatenates feature sets before recognition; and Decision-Level fusion, which makes the final decision based on outputs of the unimodal models. However, different features may describe data at different time scales or have different levels of abstraction. Cognitive Science research also indicates that when perceiving emotions, humans use information from different modalities at different cognitive levels and time steps. Therefore, we propose a Hierarchical fusion strategy for multimodal emotion recognition, which incorporates global or more abstract features at higher levels of its knowledge-inspired structure. We build multimodal emotion recognition models combining state-of-the-art acoustic and lexical features to study the performance of the proposed Hierarchical fusion. Experiments on two emotion databases of spoken dialogue show that this fusion strategy consistently outperforms both Feature-Level and Decision-Level fusion. The multimodal emotion recognition models using the Hierarchical fusion strategy achieved state-of-the-art performance on recognizing emotions in both spontaneous and acted dialogue.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124895494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 45
Improved prediction of the accent gap between speakers of English for individual-based clustering of World Englishes 改进了基于个体的世界英语聚类的英语说话者之间的口音差距预测
2016 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846255
Fumiya Shiozawa, D. Saito, N. Minematsu
{"title":"Improved prediction of the accent gap between speakers of English for individual-based clustering of World Englishes","authors":"Fumiya Shiozawa, D. Saito, N. Minematsu","doi":"10.1109/SLT.2016.7846255","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846255","url":null,"abstract":"The term of “World Englishes” describes the current state of English and one of their main characteristics is a large diversity of pronunciation, called accents. In our previous studies, we developed several techniques to realize effective clustering and visualization of the diversity. For this aim, the accent gap between two speakers has to be quantified independently of extra-linguistic factors such as age and gender. To realize this, a unique representation of speech, called speech structure, which is theoretically invariant against these factors, was applied to represent pronunciation. In the current study, by controlling the degree of invariance, we attempt to improve accent gap prediction. Two techniques are tested: DNN-based model-free estimation of divergence and multi-stream speech structures. In the former, instead of estimating separability between two speech events based on some model assumptions, DNN-based class posteriors are utilized for estimation. In the latter, by deriving one speech structure for each sub-space of acoustic features, constrained invariance is realized. Our proposals are tested in terms of the correlation between reference accent gaps and the predicted and quantified gaps. Experiments show that the correlation is improved from 0.718 to 0.730.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129088224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A log-linear weighting approach in the Word2vec space for spoken language understanding 用于口语理解的Word2vec空间的对数线性加权方法
2016 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846289
Killian Janod, Mohamed Morchid, Richard Dufour, G. Linarès
{"title":"A log-linear weighting approach in the Word2vec space for spoken language understanding","authors":"Killian Janod, Mohamed Morchid, Richard Dufour, G. Linarès","doi":"10.1109/SLT.2016.7846289","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846289","url":null,"abstract":"This paper proposes an original method which integrates contextual information of words into Word2vec neural networks that learn from words and their respective context windows. In the classical word embedding approach, context windows are represented as bag-of-words, i.e. every word in the context is treated equally. A log-linear weighting approach modeling the continuous context is proposed in our model to take into account the relative position of words in the surrounding context of the word. Quality improvements implied by this method are shown on the the Semantic-Syntactic Word Relationship test and on a real application framework implying a theme identification task of human dialogues. The promising gains of our adapted Word2vec model of 7 and 5 points for Skip-gram and CBOW approaches respectively demonstrate that the proposed models are a step forward for word and document representation.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131889255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Improving multi-stream classification by mapping sequence-embedding in a high dimensional space 利用高维空间映射序列嵌入改进多流分类
2016 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846269
Mohamed Bouaziz, Mohamed Morchid, Richard Dufour, G. Linarès
{"title":"Improving multi-stream classification by mapping sequence-embedding in a high dimensional space","authors":"Mohamed Bouaziz, Mohamed Morchid, Richard Dufour, G. Linarès","doi":"10.1109/SLT.2016.7846269","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846269","url":null,"abstract":"Most of the Natural and Spoken Language Processing tasks now employ Neural Networks (NN), allowing them to reach impressive performances. Embedding features allow the NLP systems to represent input vectors in a latent space and to improve the observed performances. In this context, Recurrent Neural Network (RNN) based architectures such as Long Short-Term Memory (LSTM) are well known for their capacity to encode sequential data into a non-sequential hidden vector representation, called sequence embedding. In this paper, we propose an LSTM-based multi-stream sequence embedding in order to encode parallel sequences by a single non-sequential latent representation vector. We then propose to map this embedding representation in a high-dimensional space using a Support Vector Machine (SVM) in order to classify the multi-stream sequences by finding out an optimal hyperplane. Multi-stream sequence embedding allowed the SVM classifier to more efficiently profit from information carried by both parallel streams and longer sequences. The system achieved the best performance, in a multi-stream sequence classification task, with a gain of 9 points in error rate compared to an SVM trained on the original input sequences.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114891322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Automatic optimization of data perturbation distributions for multi-style training in speech recognition 语音识别中多风格训练数据扰动分布的自动优化
2016 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846240
Mortaza Doulaty, R. Rose, O. Siohan
{"title":"Automatic optimization of data perturbation distributions for multi-style training in speech recognition","authors":"Mortaza Doulaty, R. Rose, O. Siohan","doi":"10.1109/SLT.2016.7846240","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846240","url":null,"abstract":"Speech recognition performance using deep neural network based acoustic models is known to degrade when the acoustic environment and the speaker population in the target utterances are significantly different from the conditions represented in the training data. To address these mismatched scenarios, multi-style training (MTR) has been used to perturb utterances in an existing uncorrupted and potentially mismatched training speech corpus to better match target domain utterances. This paper addresses the problem of determining the distribution of perturbation levels for a given set of perturbation types that best matches the target speech utterances. An approach is presented that, given a small set of utterances from a target domain, automatically identifies an empirical distribution of perturbation levels that can be applied to utterances in an existing training set. Distributions are estimated for perturbation types that include acoustic background environments, reverberant room configurations, and speaker related variation like frequency and temporal warping. The end goal is for the resulting perturbed training set to characterize the variability in the target domain and thereby optimize ASR performance. An experimental study is performed to evaluate the impact of this approach on ASR performance when the target utterances are taken from a simulated far-field acoustic environment.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114923030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Unsupervised context learning for speech recognition 语音识别的无监督上下文学习
2016 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846302
A. Michaely, M. Ghodsi, Zelin Wu, Justin Scheiner, Petar S. Aleksic
{"title":"Unsupervised context learning for speech recognition","authors":"A. Michaely, M. Ghodsi, Zelin Wu, Justin Scheiner, Petar S. Aleksic","doi":"10.1109/SLT.2016.7846302","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846302","url":null,"abstract":"It has been shown in the literature that automatic speech recognition systems can greatly benefit from contextual information [1, 2, 3, 4, 5]. Contextual information can be used to simplify the beam search and improve recognition accuracy. Types of useful contextual information can include the name of the application the user is in, the contents of the user's phone screen, the user's location, a certain dialog state, etc. Building a separate language model for each of these types of context is not feasible due to limited resources or limited amounts of training data. In this paper we describe an approach for unsupervised learning of contextual information and automatic building of contextual biasing models. Our approach can be used to build a large number of small contextual models from a limited amount of available unsupervised training data. We describe how n-grams relevant for a particular context are automatically selected as well as how an optimal size of a final contextual model is chosen. Our experimental results show great accuracy improvements for several types of context.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116112553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Deep learning with maximal figure-of-merit cost to advance multi-label speech attribute detection 基于最优值代价的深度学习推进多标签语音属性检测
2016 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846308
Ivan Kukanov, Ville Hautamäki, S. Siniscalchi, Kehuang Li
{"title":"Deep learning with maximal figure-of-merit cost to advance multi-label speech attribute detection","authors":"Ivan Kukanov, Ville Hautamäki, S. Siniscalchi, Kehuang Li","doi":"10.1109/SLT.2016.7846308","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846308","url":null,"abstract":"In this work, we are interested in boosting speech attribute detection by formulating it as a multi-label classification task, and deep neural networks (DNNs) are used to design speech attribute detectors. A straightforward way to tackle the speech attribute detection task is to estimate DNN parameters using the mean squared error (MSE) loss function and employ a sigmoid function in the DNN output nodes. A more principled way is nonetheless to incorporate the micro-F1 measure, which is a widely used metric in the multi-label classification, into the DNN loss function to directly improve the metric of interest at training time. Micro-F1 is not differentiable, yet we overcome such a problem by casting our task under the maximal figure-of-merit (MFoM) learning framework. The results demonstrate that our MFoM approach consistently outperforms the baseline systems.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121625188","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
An unsupervised vocabulary selection technique for Chinese automatic speech recognition 中文语音自动识别的无监督词汇选择技术
2016 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846298
Yike Zhang, Pengyuan Zhang, Ta Li, Yonghong Yan
{"title":"An unsupervised vocabulary selection technique for Chinese automatic speech recognition","authors":"Yike Zhang, Pengyuan Zhang, Ta Li, Yonghong Yan","doi":"10.1109/SLT.2016.7846298","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846298","url":null,"abstract":"The vocabulary is a vital component of automatic speech recognition(ASR) systems. For a specific Chinese speech recognition task, using a large general vocabulary not only leads to a much longer time to decode, but also hurts the recognition accuracy. In this paper, we proposed an unsupervised algorithm to select task-specific words from a large general vocabulary. The out-of-vocabulary(OOV) rate is a measure of vocabularies, and it is related to the recognition accuracy. However, it is hard to compute OOV rate for a Chinese vocabulary, since OOVs are often segmented into single Chinese characters and most Chinese vocabularies contain all the single Chinese characters. To deal with this problem, we proposed a novel method to estimate the OOV rate of Chinese vocabularies. In experiments, we found that our estimated OOV rate is related to the character error rate(CER) of recognition. Our proposed vocabulary selection method provided both the lowest OOV rate and CER on two Chinese conversational telephone speech(CTS) evaluation sets compared to the general vocabulary and frequency based vocabulary selection method. In addition, our proposed method significantly reduced the size of the language model(LM) and the corresponding weighted finite state transducer(WFST) network, which led to a more efficient decoding.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"99 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123135691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Semantic model for fast tagging of word lattices 快速标注词格的语义模型
2016 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846295
L. Velikovich
{"title":"Semantic model for fast tagging of word lattices","authors":"L. Velikovich","doi":"10.1109/SLT.2016.7846295","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846295","url":null,"abstract":"This paper introduces a semantic tagger that inserts tags into a word lattice, such as one produced by a real-time large-vocabulary speech recognition system. Benefits of such a tagger include the ability to rescore speech recognition hypotheses based on this metadata, as well as providing rich annotations to clients downstream. We focus on the domain of spoken search queries and voice commands, which can be useful for building an intelligent assistant. We explore a method to distill a pre-existing very large named entity disambiguation (NED) model into a lightweight tagger. This is accomplished by constructing a joint distribution of tagged n-grams from a supervised training corpus, then deriving a conditional distribution for a given lattice. With 300 tagging categories, the tagger achieves a precision of 88.2% and recall of 93.1% on 1-best paths in speech recognition lattices with 2.8ms median latency.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"197 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130033847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Environmentally robust audio-visual speaker identification 环境稳健的视听扬声器识别
2016 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846282
Lea Schonherr, Dennis Orth, M. Heckmann, D. Kolossa
{"title":"Environmentally robust audio-visual speaker identification","authors":"Lea Schonherr, Dennis Orth, M. Heckmann, D. Kolossa","doi":"10.1109/SLT.2016.7846282","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846282","url":null,"abstract":"To improve the accuracy of audio-visual speaker identification, we propose a new approach, which achieves an optimal combination of the different modalities on the score level. We use the i-vector method for the acoustics and the local binary pattern (LBP) for the visual speaker recognition. Regarding the input data of both modalities, multiple confidence measures are utilized to calculate an optimal weight for the fusion. Thus, oracle weights are chosen in such a way as to maximize the difference between the score of the genuine speaker and the person with the best competing score. Based on these oracle weights a mapping function for weight estimation is learned. To test the approach, various combinations of noise levels for the acoustic and visual data are considered. We show that the weighted multimodal identification is far less influenced by the presence of noise or distortions in acoustic or visual observations in comparison to an unweighted combination.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"142 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124626342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信