2018 IEEE Spoken Language Technology Workshop (SLT)最新文献

筛选
英文 中文
Prediction of Dialogue Success with Spectral and Rhythm Acoustic Features Using DNNS and SVMS 基于深度神经网络和支持向量机的频谱和节奏声学特征对话成功预测
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639580
Athanasios Lykartsis, M. Kotti, A. Papangelis, Y. Stylianou
{"title":"Prediction of Dialogue Success with Spectral and Rhythm Acoustic Features Using DNNS and SVMS","authors":"Athanasios Lykartsis, M. Kotti, A. Papangelis, Y. Stylianou","doi":"10.1109/SLT.2018.8639580","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639580","url":null,"abstract":"In this paper we investigate the novel use of exclusively audio to predict whether a spoken dialogue will be successful or not, both in a subjective and in an objective manner. To achieve that, multiple spectral and rhythmic features are inputted to support vector machines and deep neural networks. We report results on data from 3267 spoken dialogues, using both the full user response as well as parts of it. Experiments show an average accuracy of 74% can be achieved using just 5 acoustic features, when analysing merely 1 user turn, which allows both a real-time but also a fairly accurate prediction of a dialogue successfulness only after one short interaction unit. From the features tested, those related to speech rate, signal energy and cepstrum are amongst the most informative. Results presented here outperform the state of the art in spoken dialogue success prediction through solely acoustic features.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130502418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Domain Adaptation of End-to-end Speech Recognition in Low-Resource Settings 低资源环境下端到端语音识别的域自适应
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639506
Lahiru Samarakoon, B. Mak, Albert Y. S. Lam
{"title":"Domain Adaptation of End-to-end Speech Recognition in Low-Resource Settings","authors":"Lahiru Samarakoon, B. Mak, Albert Y. S. Lam","doi":"10.1109/SLT.2018.8639506","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639506","url":null,"abstract":"End-to-end automatic speech recognition (ASR) has simplified the traditional ASR system building pipeline by eliminating the need to have multiple components and also the requirement for expert linguistic knowledge for creating pronunciation dictionaries. Therefore, end-to-end ASR fits well when building systems for new domains. However, one major drawback of end-to-end ASR is that, it is necessary to have a larger amount of labeled speech in comparison to traditional methods. Therefore, in this paper, we explore domain adaptation approaches for end-to-end ASR in low-resource settings. We show that joint domain identification and speech recognition by inserting a symbol for domain at the beginning of the label sequence, factorized hidden layer adaptation and a domain-specific gating mechanism improve the performance for a low-resource target domain. Furthermore, we also show the robustness of proposed adaptation methods to an unseen domain, when only 3 hours of untranscribed data is available with improvements reporting upto 8.7% relative.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114948087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Teacher-Student Training for Text-Independent Speaker Recognition 独立文本说话人识别的师生培训
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639564
Raymond W. M. Ng, Xuechen Liu, P. Swietojanski
{"title":"Teacher-Student Training for Text-Independent Speaker Recognition","authors":"Raymond W. M. Ng, Xuechen Liu, P. Swietojanski","doi":"10.1109/SLT.2018.8639564","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639564","url":null,"abstract":"This paper investigates text-independent speaker recognition using neural embedding extractors based on the time-delay neural network. Our primary focus is to explore the teacher-student (TS) training framework for knowledge distillation in a text-independent (TI) speaker recognition task. We report the results on both proprietary and public benchmarks, obtaining competitive results with 88–93% smaller models. Particularly, in clean testing conditions, we find TS training on neural-based TI systems achieved same or better performance than the i-vector based counterparts. Neural embeddings are less prone to short segment issues, and offer better performance particularly in the high-recall setting. They can also provide some additional insights about speakers, such as gender or how difficult a given speaker can be for recognition.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116501269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Out-of-Domain Slot Value Detection for Spoken Dialogue Systems with Context Information 具有上下文信息的口语对话系统的域外槽值检测
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639671
Yuka Kobayashi, Takami Yoshida, K. Iwata, Hiroshi Fujimura, M. Akamine
{"title":"Out-of-Domain Slot Value Detection for Spoken Dialogue Systems with Context Information","authors":"Yuka Kobayashi, Takami Yoshida, K. Iwata, Hiroshi Fujimura, M. Akamine","doi":"10.1109/SLT.2018.8639671","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639671","url":null,"abstract":"This paper proposes an approach to detecting-of-domain slot values from user utterances in spoken dialogue systems based on contexts. The approach detects keywords of slot values from utterances and consults domain knowledge (i.e., an ontology) to check whether the keywords are-of-domain. This can prevent the systems from responding improperly to user requests. We use a Recurrent Neural Network (RNN) encoder-decoder model and propose a method that uses only in-domain data. The method replaces word embedding vectors of the keywords corresponding to slot values with random vectors during training of the model. This allows using context information. The model is robust against over-fitting problems because it is independent of the slot values of the training data. Experiments show that the proposed method achieves a 65% gain in F1 score relative to a baseline model and a further 13 percentage points by combining with other methods.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114756989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Ranking Approach to Compact Text Representation for Personal Digital Assistants 面向个人数字助理的紧凑文本表示排序方法
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639542
Issac Alphonso, Nick Kibre, T. Anastasakos
{"title":"Ranking Approach to Compact Text Representation for Personal Digital Assistants","authors":"Issac Alphonso, Nick Kibre, T. Anastasakos","doi":"10.1109/SLT.2018.8639542","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639542","url":null,"abstract":"Personal digital assistants must display the output from the speech recognizer in a compact and readable representation. The process of transforming sequences from spoken words to written text is called inverse text normalization (ITN). In this paper, we present a ranking based approach to ITN that incorporates predicative information from various neural-net LSTM and n-gram models to select the best written text to display. Our approach ranks the written text candidates, generated by applying weighted FSTs to the spoken words, using a gradient boosted decision tree ensemble (GBDT). The ranker achieves a 18.48% relative reduction in word error rate over an unweighted FST system. Further, our two-stage approach allows us to decouple speech recognition from ITN and gives us greater flexibility in system configuration, since the written-form can vary by domain.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128573078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
[Copyright notice] (版权)
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/slt.2018.8639640
{"title":"[Copyright notice]","authors":"","doi":"10.1109/slt.2018.8639640","DOIUrl":"https://doi.org/10.1109/slt.2018.8639640","url":null,"abstract":"","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"224 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133698243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Posterior Calibration for Multi-Class Paralinguistic Classification 多类副语言分类的后验校正
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639628
G. Gosztolya, R. Busa-Fekete
{"title":"Posterior Calibration for Multi-Class Paralinguistic Classification","authors":"G. Gosztolya, R. Busa-Fekete","doi":"10.1109/SLT.2018.8639628","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639628","url":null,"abstract":"Computational paralinguistics is an area which contains diverse classification tasks. In many cases the class distribution of these tasks is highly imbalanced by nature, as the phenomena needed to detect in human speech do not occur uniformly. To ignore this imbalance, it is common to measure the efficiency of classification approaches via the Unweighted Average Recall (UAR) metric in this area. However, general classification methods such as Support-Vector Machines (SVM) and Deep Neural Networks (DNNs) were shown to focus on traditional classification accuracy, which might lead to a suboptimal performance for imbalanced datasets. In this study we show that by performing posterior calibration, this effect can be countered and the UAR scores obtained might be improved. Our approach led to relative error reduction values of 4% and 14% on the test set of two multi-class paralinguistic datasets that had imbalanced class distributions, outperforming the traditional downsampling.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133173178","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Teacher-Student Learning Approach for Unsupervised Domain Adaptation of Sequence-Trained ASR Models 序列训练ASR模型无监督域自适应的师生学习方法
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639635
Vimal Manohar, Pegah Ghahremani, Daniel Povey, S. Khudanpur
{"title":"A Teacher-Student Learning Approach for Unsupervised Domain Adaptation of Sequence-Trained ASR Models","authors":"Vimal Manohar, Pegah Ghahremani, Daniel Povey, S. Khudanpur","doi":"10.1109/SLT.2018.8639635","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639635","url":null,"abstract":"Teacher-student (T-S) learning is a transfer learning approach, where a teacher network is used to “teach” a student network to make the same predictions as the teacher. Originally formulated for model compression, this approach has also been used for domain adaptation, and is particularly effective when parallel data is available in source and target domains. The standard approach uses a frame-level objective of minimizing the KL divergence between the frame-level posteriors of the teacher and student networks. However, for sequence-trained models for speech recognition, it is more appropriate to train the student to mimic the sequence-level posterior of the teacher network. In this work, we compare this sequence-level KL divergence objective with another semi-supervised sequence-training method, namely the lattice-free MMI, for unsupervised domain adaptation. We investigate the approaches in multiple scenarios including adapting from clean to noisy speech, bandwidth mismatch and channel mismatch.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121925913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 44
Densenet Blstm for Acoustic Modeling in Robust ASR 鲁棒ASR声学建模的密度模型
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639529
Maximilian Strake, Pascal Behr, Timo Lohrenz, T. Fingscheidt
{"title":"Densenet Blstm for Acoustic Modeling in Robust ASR","authors":"Maximilian Strake, Pascal Behr, Timo Lohrenz, T. Fingscheidt","doi":"10.1109/SLT.2018.8639529","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639529","url":null,"abstract":"In recent years, robust automatic speech recognition (ASR) has greatly taken benefit from the use of neural networks for acoustic modeling, although performance still degrades in severe noise conditions. Based on the previous success of models using convolutional and subsequent bidirectional long short-term memory (BLSTM) layers in the same network, we propose to use a densely connected convolutional network (DenseNet) as the first part of such a model, while the second is a BLSTM network. A particular contribution of our work is that we modify the DenseNet topology to become a kind of feature extractor for the subsequent BLSTM network operating on whole speech utterances. We evaluate our model on the 6-channel task of CHiME-4, and are able to consistently outperform a top-performing baseline based on wide residual networks and BLSTMs providing a 2.4% relative WER reduction on the real test set.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123501815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Combining De-noising Auto-encoder and Recurrent Neural Networks in End-to-End Automatic Speech Recognition for Noise Robustness 结合去噪自编码器和递归神经网络的端到端自动语音识别噪声鲁棒性
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639597
Tzu-Hsuan Ting, Chia-Ping Chen
{"title":"Combining De-noising Auto-encoder and Recurrent Neural Networks in End-to-End Automatic Speech Recognition for Noise Robustness","authors":"Tzu-Hsuan Ting, Chia-Ping Chen","doi":"10.1109/SLT.2018.8639597","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639597","url":null,"abstract":"In this paper, we propose an end-to-end noise-robust automatic speech recognition system through deep-learning implementation of de-noising auto-encoders and recurrent neural networks. We use batch normalization and a novel design for the front-end de-noising auto-encoder, which mimics a two-stage prediction of a single-frame clean feature vector from multi-frame noisy feature vectors. For the backend word recognition, we use an end-to-end system based on bidirectional recurrent neural network with long short-term memory cells. The LSTM-BiRNN is trained via connectionist temporal classification criterion. Its performance is compared to a baseline backend based on hidden Markov models and Gaussian mixture models (HMM-GMM). Our experimental results show that the proposed novel front-end de-noising auto-encoder outperforms the best record we can find for the Aurora 2.0 clean-condition training tasks by an absolute improvement of 1.2% (6.0% vs. 7.2%). In addition, the proposed end-to-end back-end architecture is as good as the traditional HMM-GMM back-end recognizer.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127773992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信