Vimal Manohar, Pegah Ghahremani, Daniel Povey, S. Khudanpur
{"title":"A Teacher-Student Learning Approach for Unsupervised Domain Adaptation of Sequence-Trained ASR Models","authors":"Vimal Manohar, Pegah Ghahremani, Daniel Povey, S. Khudanpur","doi":"10.1109/SLT.2018.8639635","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639635","url":null,"abstract":"Teacher-student (T-S) learning is a transfer learning approach, where a teacher network is used to “teach” a student network to make the same predictions as the teacher. Originally formulated for model compression, this approach has also been used for domain adaptation, and is particularly effective when parallel data is available in source and target domains. The standard approach uses a frame-level objective of minimizing the KL divergence between the frame-level posteriors of the teacher and student networks. However, for sequence-trained models for speech recognition, it is more appropriate to train the student to mimic the sequence-level posterior of the teacher network. In this work, we compare this sequence-level KL divergence objective with another semi-supervised sequence-training method, namely the lattice-free MMI, for unsupervised domain adaptation. We investigate the approaches in multiple scenarios including adapting from clean to noisy speech, bandwidth mismatch and channel mismatch.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121925913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maximilian Strake, Pascal Behr, Timo Lohrenz, T. Fingscheidt
{"title":"Densenet Blstm for Acoustic Modeling in Robust ASR","authors":"Maximilian Strake, Pascal Behr, Timo Lohrenz, T. Fingscheidt","doi":"10.1109/SLT.2018.8639529","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639529","url":null,"abstract":"In recent years, robust automatic speech recognition (ASR) has greatly taken benefit from the use of neural networks for acoustic modeling, although performance still degrades in severe noise conditions. Based on the previous success of models using convolutional and subsequent bidirectional long short-term memory (BLSTM) layers in the same network, we propose to use a densely connected convolutional network (DenseNet) as the first part of such a model, while the second is a BLSTM network. A particular contribution of our work is that we modify the DenseNet topology to become a kind of feature extractor for the subsequent BLSTM network operating on whole speech utterances. We evaluate our model on the 6-channel task of CHiME-4, and are able to consistently outperform a top-performing baseline based on wide residual networks and BLSTMs providing a 2.4% relative WER reduction on the real test set.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123501815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Combining De-noising Auto-encoder and Recurrent Neural Networks in End-to-End Automatic Speech Recognition for Noise Robustness","authors":"Tzu-Hsuan Ting, Chia-Ping Chen","doi":"10.1109/SLT.2018.8639597","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639597","url":null,"abstract":"In this paper, we propose an end-to-end noise-robust automatic speech recognition system through deep-learning implementation of de-noising auto-encoders and recurrent neural networks. We use batch normalization and a novel design for the front-end de-noising auto-encoder, which mimics a two-stage prediction of a single-frame clean feature vector from multi-frame noisy feature vectors. For the backend word recognition, we use an end-to-end system based on bidirectional recurrent neural network with long short-term memory cells. The LSTM-BiRNN is trained via connectionist temporal classification criterion. Its performance is compared to a baseline backend based on hidden Markov models and Gaussian mixture models (HMM-GMM). Our experimental results show that the proposed novel front-end de-noising auto-encoder outperforms the best record we can find for the Aurora 2.0 clean-condition training tasks by an absolute improvement of 1.2% (6.0% vs. 7.2%). In addition, the proposed end-to-end back-end architecture is as good as the traditional HMM-GMM back-end recognizer.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127773992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anastassia Loukina, Nitin Madnani, Beata Beigman Klebanov, A. Misra, Georgi Angelov, O. Todic
{"title":"Evaluating on-device ASR on Field Recordings from an Interactive Reading Companion","authors":"Anastassia Loukina, Nitin Madnani, Beata Beigman Klebanov, A. Misra, Georgi Angelov, O. Todic","doi":"10.1109/SLT.2018.8639603","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639603","url":null,"abstract":"Many applications designed to assess and improve oral reading fluency use automated speech recognition (ASR) to provide feedback to students, teachers, and parents. Most such applications rely on a distributed architecture with the speech recognition component located in the cloud. For interactive applications, this approach requires a reliable Internet connection that may not always be available. We investigate whether on-device ASR can be used for a virtual reading companion using recordings obtained from children both in a controlled environment and in the field. Our limited evaluation makes us cautiously optimistic about the feasibility of using on-device ASR for our application.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127816338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anna Björk Nikulásdóttir, Jón Guðnason, Eiríkur Rögnvaldsson
{"title":"An Icelandic Pronunciation Dictionary for TTS","authors":"Anna Björk Nikulásdóttir, Jón Guðnason, Eiríkur Rögnvaldsson","doi":"10.1109/SLT.2018.8639590","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639590","url":null,"abstract":"This paper describes an Icelandic pronunciation dictionary for speech applications and its processing for use in a text-to-speech system for Icelandic. Cleaning and correction procedures were implemented to create a consistent training set for grapheme-to-phoneme conversion modeling, needed for the automatic extension of the dictionary. Experiments with the original version of the dictionary and the cleaned version described in this paper as training sets for a joint sequence g2p algorithm show a clear benefit of using clean data for training, both in terms of PER and in terms of categories of errors made by the g2p algorithm. The results of the dictionary processing where also used to create an initial version of an open source database for Icelandic speech applications.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130454887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Toward Multi-Features Emphasis Speech Translation: Assessment of Human Emphasis Production and Perception with Speech and Text Clues","authors":"Quoc Truong Do, S. Sakti, Satoshi Nakamura","doi":"10.1109/SLT.2018.8639641","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639641","url":null,"abstract":"Emphasis is an important factor of human speech that helps convey emotion and the focused information of utterances. Recently, studies have been conducted on speech-to-speech translation to preserve the emphasis information from the source language to the target language. However, since different cultures have various ways of expressing emphasis, just considering the acoustic-to-acoustic feature emphasis translation may not always reflect the experiences of users. On the other hand, emphasis can be expressed at various levels in both text and speech. However, it remains unclear how we communicate emphasis in a different form (acoustic/linguistic) with different levels and whether we can perceive the difference between different levels of emphasis or observe the similarity of the same emphasis levels in both text and speech forms. In this paper, we conducted analyses on human perception of emphasis with both speech and text clues through crowd-sourced evaluations. The results indicate that although participants can distinguish among emphasis levels and perceive the same emphasis level between speech and text, many ambiguities still exist at certain emphasis levels. Thus, our result provides insight into what needs to be handled during the emphasis translation process.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117163567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Patrick Lumban Tobing, Tomoki Hayashi, Yi-Chiao Wu, Kazuhiro Kobayashi, T. Toda
{"title":"An Evaluation of Deep Spectral Mappings and WaveNet Vocoder for Voice Conversion","authors":"Patrick Lumban Tobing, Tomoki Hayashi, Yi-Chiao Wu, Kazuhiro Kobayashi, T. Toda","doi":"10.1109/SLT.2018.8639608","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639608","url":null,"abstract":"This paper presents an evaluation of deep spectral mapping and WaveNet vocoder in voice conversion (VC). In our VC framework, spectral features of an input speaker are converted into those of a target speaker using the deep spectral mapping, and then together with the excitation features, the converted waveform is generated using WaveNet vocoder. In this work, we compare three different deep spectral mapping networks, i.e., a deep single density network (DSDN), a deep mixture density network (DMDN), and a long short-term memory recurrent neural network with an autoregressive output layer (LSTM-AR). Moreover, we also investigate several methods for reducing mismatches of spectral features used in WaveNet vocoder between training and conversion processes, such as some methods to alleviate oversmoothing effects of the converted spectral features, and another method to refine WaveNet using the converted spectral features. The experimental results demonstrate that the LSTM-AR yields nearly better spectral mapping accuracy than the others, and the proposed WaveNet refinement method significantly improves the naturalness of the converted waveform.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117266383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High-Degree Feature for Deep Neural Network Based Acoustic Model","authors":"Hoon Chung, Sung Joo Lee, J. Park","doi":"10.1109/SLT.2018.8639524","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639524","url":null,"abstract":"In this paper, we propose to use high-degree features to improve the discrimination performance of Deep Neural Network (DNN) based acoustic model. Thanks to the successful posterior probability estimation of DNNs for high-dimensional features, high-dimensional acoustic features are commonly considered in DNN-based acoustic models.Even though it is not clear how DNN-based acoustic models estimate the posterior probability robustly, the use of high-dimensional features is based on a theorem that it helps separability of patters. There is another well-known knowledge that high-degree features increase linear separability of nonlinear input features. However, there is little work to exploit high-degree features explicitly in a DNN-based acoustic model. Therefore, in this work, we investigate high-degree features to improve the performance further.In this work, the proposed approach was evaluated on a Wall Street Journal (WSJ) speech recognition domain. The proposed method achieved up to 21.8% error reduction rate for the Eval92 test set by reducing the word error rate from 4.82% to 3.77% when using degree-2 polynomial expansion.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"212 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132654270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multichannel ASR with Knowledge Distillation and Generalized Cross Correlation Feature","authors":"Wenjie Li, Yu Zhang, Pengyuan Zhang, Fengpei Ge","doi":"10.1109/SLT.2018.8639600","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639600","url":null,"abstract":"Multi-channel signal processing techniques have played an important role in the far-field automatic speech recognition (ASR) as the separate front-end enhancement part. However, they often meet the mismatch problem. In this paper, we proposed a novel architecture of acoustic model, in which the multi-channel speech without preprocessing was utilized directly. Besides the strategy of knowledge distillation and the generalized cross correlation (GCC) adaptation were employed. We use knowledge distillation to transfer knowledge from a well-trained close-talking model to distant-talking scenarios in every frame of the multichannel distant speech. Moreover, the GCC between microphones, which contains the spatial information, is supplied as an auxiliary input to the neural network. We observe good compensation of those two techniques. Evaluated with the AMI and ICSI meeting corpora, the proposed methods achieve relative WER improvement of 7.7% and 7.5% over the model trained directly on the concatenated multi-channel speech.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131415925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alexis Thual, Corentin Dancette, Julien Karadayi, Juan Benjumea, Emmanuel Dupoux
{"title":"A K-Nearest Neighbours Approach To Unsupervised Spoken Term Discovery","authors":"Alexis Thual, Corentin Dancette, Julien Karadayi, Juan Benjumea, Emmanuel Dupoux","doi":"10.1109/SLT.2018.8639515","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639515","url":null,"abstract":"Unsupervised spoken term discovery is the task of finding recurrent acoustic patterns in speech without any annotations. Current approaches consists of two steps: (1) discovering similar patterns in speech, and (2) partitioning those pairs of acoustic tokens using graph clustering methods. We propose a new approach for the first step. Previous systems used various approximation algorithms to make the search tractable on large amounts of data. Our approach is based on an optimized k-nearest neighbours (KNN) search coupled with a fixed word embedding algorithm. The results show that the KNN algorithm is robust across languages, consistently-performs the DTW-based baseline, and is competitive with current state-of-the-art spoken term discovery systems.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114881134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}