Athanasios Lykartsis, M. Kotti, A. Papangelis, Y. Stylianou
{"title":"Prediction of Dialogue Success with Spectral and Rhythm Acoustic Features Using DNNS and SVMS","authors":"Athanasios Lykartsis, M. Kotti, A. Papangelis, Y. Stylianou","doi":"10.1109/SLT.2018.8639580","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639580","url":null,"abstract":"In this paper we investigate the novel use of exclusively audio to predict whether a spoken dialogue will be successful or not, both in a subjective and in an objective manner. To achieve that, multiple spectral and rhythmic features are inputted to support vector machines and deep neural networks. We report results on data from 3267 spoken dialogues, using both the full user response as well as parts of it. Experiments show an average accuracy of 74% can be achieved using just 5 acoustic features, when analysing merely 1 user turn, which allows both a real-time but also a fairly accurate prediction of a dialogue successfulness only after one short interaction unit. From the features tested, those related to speech rate, signal energy and cepstrum are amongst the most informative. Results presented here outperform the state of the art in spoken dialogue success prediction through solely acoustic features.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130502418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Domain Adaptation of End-to-end Speech Recognition in Low-Resource Settings","authors":"Lahiru Samarakoon, B. Mak, Albert Y. S. Lam","doi":"10.1109/SLT.2018.8639506","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639506","url":null,"abstract":"End-to-end automatic speech recognition (ASR) has simplified the traditional ASR system building pipeline by eliminating the need to have multiple components and also the requirement for expert linguistic knowledge for creating pronunciation dictionaries. Therefore, end-to-end ASR fits well when building systems for new domains. However, one major drawback of end-to-end ASR is that, it is necessary to have a larger amount of labeled speech in comparison to traditional methods. Therefore, in this paper, we explore domain adaptation approaches for end-to-end ASR in low-resource settings. We show that joint domain identification and speech recognition by inserting a symbol for domain at the beginning of the label sequence, factorized hidden layer adaptation and a domain-specific gating mechanism improve the performance for a low-resource target domain. Furthermore, we also show the robustness of proposed adaptation methods to an unseen domain, when only 3 hours of untranscribed data is available with improvements reporting upto 8.7% relative.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114948087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Teacher-Student Training for Text-Independent Speaker Recognition","authors":"Raymond W. M. Ng, Xuechen Liu, P. Swietojanski","doi":"10.1109/SLT.2018.8639564","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639564","url":null,"abstract":"This paper investigates text-independent speaker recognition using neural embedding extractors based on the time-delay neural network. Our primary focus is to explore the teacher-student (TS) training framework for knowledge distillation in a text-independent (TI) speaker recognition task. We report the results on both proprietary and public benchmarks, obtaining competitive results with 88–93% smaller models. Particularly, in clean testing conditions, we find TS training on neural-based TI systems achieved same or better performance than the i-vector based counterparts. Neural embeddings are less prone to short segment issues, and offer better performance particularly in the high-recall setting. They can also provide some additional insights about speakers, such as gender or how difficult a given speaker can be for recognition.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116501269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuka Kobayashi, Takami Yoshida, K. Iwata, Hiroshi Fujimura, M. Akamine
{"title":"Out-of-Domain Slot Value Detection for Spoken Dialogue Systems with Context Information","authors":"Yuka Kobayashi, Takami Yoshida, K. Iwata, Hiroshi Fujimura, M. Akamine","doi":"10.1109/SLT.2018.8639671","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639671","url":null,"abstract":"This paper proposes an approach to detecting-of-domain slot values from user utterances in spoken dialogue systems based on contexts. The approach detects keywords of slot values from utterances and consults domain knowledge (i.e., an ontology) to check whether the keywords are-of-domain. This can prevent the systems from responding improperly to user requests. We use a Recurrent Neural Network (RNN) encoder-decoder model and propose a method that uses only in-domain data. The method replaces word embedding vectors of the keywords corresponding to slot values with random vectors during training of the model. This allows using context information. The model is robust against over-fitting problems because it is independent of the slot values of the training data. Experiments show that the proposed method achieves a 65% gain in F1 score relative to a baseline model and a further 13 percentage points by combining with other methods.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114756989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Ranking Approach to Compact Text Representation for Personal Digital Assistants","authors":"Issac Alphonso, Nick Kibre, T. Anastasakos","doi":"10.1109/SLT.2018.8639542","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639542","url":null,"abstract":"Personal digital assistants must display the output from the speech recognizer in a compact and readable representation. The process of transforming sequences from spoken words to written text is called inverse text normalization (ITN). In this paper, we present a ranking based approach to ITN that incorporates predicative information from various neural-net LSTM and n-gram models to select the best written text to display. Our approach ranks the written text candidates, generated by applying weighted FSTs to the spoken words, using a gradient boosted decision tree ensemble (GBDT). The ranker achieves a 18.48% relative reduction in word error rate over an unweighted FST system. Further, our two-stage approach allows us to decouple speech recognition from ITN and gives us greater flexibility in system configuration, since the written-form can vary by domain.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128573078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Posterior Calibration for Multi-Class Paralinguistic Classification","authors":"G. Gosztolya, R. Busa-Fekete","doi":"10.1109/SLT.2018.8639628","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639628","url":null,"abstract":"Computational paralinguistics is an area which contains diverse classification tasks. In many cases the class distribution of these tasks is highly imbalanced by nature, as the phenomena needed to detect in human speech do not occur uniformly. To ignore this imbalance, it is common to measure the efficiency of classification approaches via the Unweighted Average Recall (UAR) metric in this area. However, general classification methods such as Support-Vector Machines (SVM) and Deep Neural Networks (DNNs) were shown to focus on traditional classification accuracy, which might lead to a suboptimal performance for imbalanced datasets. In this study we show that by performing posterior calibration, this effect can be countered and the UAR scores obtained might be improved. Our approach led to relative error reduction values of 4% and 14% on the test set of two multi-class paralinguistic datasets that had imbalanced class distributions, outperforming the traditional downsampling.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133173178","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vimal Manohar, Pegah Ghahremani, Daniel Povey, S. Khudanpur
{"title":"A Teacher-Student Learning Approach for Unsupervised Domain Adaptation of Sequence-Trained ASR Models","authors":"Vimal Manohar, Pegah Ghahremani, Daniel Povey, S. Khudanpur","doi":"10.1109/SLT.2018.8639635","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639635","url":null,"abstract":"Teacher-student (T-S) learning is a transfer learning approach, where a teacher network is used to “teach” a student network to make the same predictions as the teacher. Originally formulated for model compression, this approach has also been used for domain adaptation, and is particularly effective when parallel data is available in source and target domains. The standard approach uses a frame-level objective of minimizing the KL divergence between the frame-level posteriors of the teacher and student networks. However, for sequence-trained models for speech recognition, it is more appropriate to train the student to mimic the sequence-level posterior of the teacher network. In this work, we compare this sequence-level KL divergence objective with another semi-supervised sequence-training method, namely the lattice-free MMI, for unsupervised domain adaptation. We investigate the approaches in multiple scenarios including adapting from clean to noisy speech, bandwidth mismatch and channel mismatch.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121925913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maximilian Strake, Pascal Behr, Timo Lohrenz, T. Fingscheidt
{"title":"Densenet Blstm for Acoustic Modeling in Robust ASR","authors":"Maximilian Strake, Pascal Behr, Timo Lohrenz, T. Fingscheidt","doi":"10.1109/SLT.2018.8639529","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639529","url":null,"abstract":"In recent years, robust automatic speech recognition (ASR) has greatly taken benefit from the use of neural networks for acoustic modeling, although performance still degrades in severe noise conditions. Based on the previous success of models using convolutional and subsequent bidirectional long short-term memory (BLSTM) layers in the same network, we propose to use a densely connected convolutional network (DenseNet) as the first part of such a model, while the second is a BLSTM network. A particular contribution of our work is that we modify the DenseNet topology to become a kind of feature extractor for the subsequent BLSTM network operating on whole speech utterances. We evaluate our model on the 6-channel task of CHiME-4, and are able to consistently outperform a top-performing baseline based on wide residual networks and BLSTMs providing a 2.4% relative WER reduction on the real test set.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123501815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Combining De-noising Auto-encoder and Recurrent Neural Networks in End-to-End Automatic Speech Recognition for Noise Robustness","authors":"Tzu-Hsuan Ting, Chia-Ping Chen","doi":"10.1109/SLT.2018.8639597","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639597","url":null,"abstract":"In this paper, we propose an end-to-end noise-robust automatic speech recognition system through deep-learning implementation of de-noising auto-encoders and recurrent neural networks. We use batch normalization and a novel design for the front-end de-noising auto-encoder, which mimics a two-stage prediction of a single-frame clean feature vector from multi-frame noisy feature vectors. For the backend word recognition, we use an end-to-end system based on bidirectional recurrent neural network with long short-term memory cells. The LSTM-BiRNN is trained via connectionist temporal classification criterion. Its performance is compared to a baseline backend based on hidden Markov models and Gaussian mixture models (HMM-GMM). Our experimental results show that the proposed novel front-end de-noising auto-encoder outperforms the best record we can find for the Aurora 2.0 clean-condition training tasks by an absolute improvement of 1.2% (6.0% vs. 7.2%). In addition, the proposed end-to-end back-end architecture is as good as the traditional HMM-GMM back-end recognizer.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127773992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}