Hossein Hadian, Daniel Povey, H. Sameti, J. Trmal, S. Khudanpur
{"title":"Improving LF-MMI Using Unconstrained Supervisions for ASR","authors":"Hossein Hadian, Daniel Povey, H. Sameti, J. Trmal, S. Khudanpur","doi":"10.1109/SLT.2018.8639684","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639684","url":null,"abstract":"We present our work on improving the numerator graph for discriminative training using the lattice-free maximum mutual information (MMI) criterion. Specifically, we propose a scheme for creating unconstrained numerator graphs by removing time constraints from the baseline numerator graphs. This leads to much smaller graphs and therefore faster preparation of training supervisions. By testing the proposed un-constrained supervisions using factorized time-delay neural network (TDNN) models, we observe 0.5% to 2.6% relative improvement over the state-of-the-art word error rates on various large-vocabulary speech recognition databases.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129669526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zvi Kons, Slava Shechtman, A. Sorin, R. Hoory, Carmel Rabinovitz, E. Morais
{"title":"Neural TTS Voice Conversion","authors":"Zvi Kons, Slava Shechtman, A. Sorin, R. Hoory, Carmel Rabinovitz, E. Morais","doi":"10.1109/SLT.2018.8639550","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639550","url":null,"abstract":"Recently, speaker adaptation of neural TTS models received significant interest, and several studies focusing on this topic have been published. All of them explore an adaptation of an initial multi-speaker model trained on a corpus containing from tens to hundreds of individual speaker voices.In this work we focus on a challenging task of TTS voice conversion where an initial system is trained on a single-speaker data and then need to be adapted to a variety of external speaker voices. The TTS voice conversion setup represents a very important use case. Transcribed multi-speaker datasets might be unavailable for many languages while any TTS technology provider is expected to have at least one suitable single-speaker dataset per supported language.We present a neural TTS system comprising separate prosody generator and synthesizer DNN models. The system is trained on a high quality proprietary male speaker dataset. We show that the system models can be converted to a variety of external male and female ordinary voices and an extremely expressive artist’s voice and present crowd-base subjective evaluation results.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"154 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127577306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Takenori Yoshimura, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku, K. Tokuda
{"title":"WaveNet-Based Zero-Delay Lossless Speech Coding","authors":"Takenori Yoshimura, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku, K. Tokuda","doi":"10.1109/SLT.2018.8639598","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639598","url":null,"abstract":"This paper presents a WaveNet-based zero-delay lossless speech coding technique for high-quality communications. The WaveNet generative model, which is a state-of-the-art model for neural-network-based speech waveform synthesis, is used in both the encoder and decoder. In the encoder, discrete speech signals are losslessly compressed using sample-by-sample entropy coding. The decoder fully reconstructs the original speech signals from the compressed signals without algorithmic delay. Experimental results show that the proposed coding technique can transmit speech audio waveforms with 50% their original bit rate and the WaveNet-based speech coder remains effective for unknown speakers.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"497 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127587658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Querying Depression Vlogs","authors":"M. J. Correia, B. Raj, I. Trancoso","doi":"10.1109/SLT.2018.8639555","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639555","url":null,"abstract":"Speech based diagnosis-aid tools for depression typically depend on few and small datasets, that are expensive to collect. The limited availability of training data poses a limitation to the quality that these systems can achieve. An unexplored alternative for large scale source of data are vlogs collected from online multimedia repositories. Along with the automation of the mining process, it is necessary to automate the labeling process too.In this work, we propose a framework to automatically label a corpus of in-the-wild vlogs of possibly depressed subjects, and we estimate the quality of the predicted labels, without ever having access to a ground truth for the majority of the corpus. The framework uses a small subset to train a model and estimate the labels for the remainder of the corpus. Then, using the predicted labels, we train a noisy model and attempt to reconstruct the labels of the original labeled subset. We hypothesize that the quality of the estimated labels for the unlabelled subset of the corpus is correlated to the quality of the label reconstruction of the labeled subset.The results of the bi-modal experiment using in-the-wild data are compared to the ones obtained using controlled data.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129984670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Suguru Kabashima, Y. Inoue, D. Saito, N. Minematsu
{"title":"DNN-Based Scoring of Language Learners’ Proficiency Using Learners’ Shadowings and Native Listeners’ Responsive Shadowings","authors":"Suguru Kabashima, Y. Inoue, D. Saito, N. Minematsu","doi":"10.1109/SLT.2018.8639645","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639645","url":null,"abstract":"This paper investigates DNN-based scoring techniques when they are applied to two tasks related to foreign language education. One is a conventional task, which attempts to predict a language learner’s overall proficiency of oral communication. For this aim, learners’ shadowing utterances are assessed automatically. The other is a very new and novel task, which attempts to predict intelligibility or comprehensibility of a learner’s pronunciation. In this task, native listeners’ responsive shadowings are assessed. For both the tasks, similar technical frameworks are tested, where DNN-based phoneme posteriors, posteriogram-based DTW scores, ASR-based accuracies, shadowing latencies, etc are used to train regression models, which aim to predict manually rated scores. Experiments show that, in both the tasks, the correlation between the DNN-based predicted scores and the averaged human scores is higher than or at least comparable to the averaged correlation between the scores of human raters. This fact clearly indicates that our proposed automatic rating module can be introduced to language education as another human rater.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127620870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving OOV Detection and Resolution with External Language Models in Acoustic-to-Word ASR","authors":"H. Inaguma, M. Mimura, S. Sakai, Tatsuya Kawahara","doi":"10.1109/SLT.2018.8639563","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639563","url":null,"abstract":"Acoustic-to-word (A2W) end-to-end automatic speech recognition (ASR) systems have attracted attention because of an extremely simplified architecture and fast decoding. To alleviate data sparseness issues due to infrequent words, the combination with an acoustic-to-character (A2C) model is investigated. Moreover, the A2C model can be used to recover-of-vocabulary (OOV) words that are not covered by the A2W model, but this requires accurate detection of OOV words. A2W models learn contexts with both acoustic and transcripts; therefore they tend to falsely recognize OOV words as words in the vocabulary. In this paper, we tackle this problem by using external language models (LM), which are trained only with transcriptions and have better linguistic information to detect OOV words. The A2C model is used to resolve these OOV words. Experimental evaluations show that external LMs have the effects of not only reducing errors but also increasing the number of detected OOV words, and the proposed method significantly improves performances in English conversational and Japanese lecture corpora, especially for-of-domain scenario. We also investigate the impact of the vocabulary size of A2W models and the data size for training LMs. Moreover, our approach can reduce the vocabulary size several times with marginal performance degradation.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126296033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-Band Processing With Gabor Filters and Time Delay Neural Nets for Noise Robust Speech Recognition","authors":"György Kovács, L. Tóth, G. Gosztolya","doi":"10.1109/SLT.2018.8639631","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639631","url":null,"abstract":"Spectro-temporal feature extraction and multi-band processing were both invented with the goal of increasing the robustness of speech recognisers. However, although these methods have been in use for a long time now, and they are evidently compatible, few attempts have been made to combine them. This is why here we investigate the combination of multi-band processing with the use of spectro-temporal Gabor filters. First, based on the TIMIT corpus, we optimise their meta-parameters like the overlap, and the number of bands. Then we verify the cross-corpus viability of our multi-band processing approach on the Aurora-4 corpus. Lastly, we combine our method with the recently proposed channel dropout method. Our results show that this combination not only leads to lower error rates than those got using either multi-band processing or channel dropout, but these results compare favourably to those recently reported for the clean training scenario on the Aurora-4 corpus.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126041429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Extension of Conventional Co-Training Learning Strategies to Three-View and Committee-Based Learning Strategies for Effective Automatic Sentence Segmentation","authors":"Dogan Dalva, Ümit Güz, Hakan Gürkan","doi":"10.1109/SLT.2018.8639533","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639533","url":null,"abstract":"The objective of this work is to develop effective multiview semi-supervised machine learning strategies for sentence boundary classification problem when only small sets of sentence boundary labeled data are available. We propose three-view and committee-based learning strategies incorporating with co-training algorithms with agreement, disagreement, and self-combined learning strategies using prosodic, lexical and morphological information. We compare experimental results of proposed three-view and committee-based learning strategies to other semi-supervised learning strategies in the literature namely, self-training and co-training with agreement, disagreement, and self-combined strategies. The experiment results show that sentence segmentation performance can be highly improved using multi-view learning strategies that we propose since data sets can be represented by three redundantly sufficient and disjoint feature sets. We show that the proposed strategies substantially improve the average performance when only a small set of manually labeled data is available for Turkish and English spoken languages, respectively.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128147901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yao Qian, Rutuja Ubale, Matthew David Mulholland, Keelan Evanini, Xinhao Wang
{"title":"A Prompt-Aware Neural Network Approach to Content-Based Scoring of Non-Native Spontaneous Speech","authors":"Yao Qian, Rutuja Ubale, Matthew David Mulholland, Keelan Evanini, Xinhao Wang","doi":"10.1109/SLT.2018.8639697","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639697","url":null,"abstract":"We present a neural network approach to the automated assessment of non-native spontaneous speech in a listen and speak task. An attention-based Long Short-Term Memory (LSTM) Recurrent Neural Network (RNN) is used to learn the relations (scoring rubrics) between the spoken responses and their assigned scores. Each prompt (listening material) is encoded as a vector in a low-dimensional space and then employed as a condition of the inputs of the attention LSTM-RNN. The experimental results show that our approach performs as well as the strong baseline of a Support Vector Regressor (SVR) using content-related features, i.e., a correlation of r = 0.806 with holistic proficiency scores provided by humans, without doing any feature engineering. The prompt-encoded vector improves the discrimination between the high-scoring sample and low-scoring sample, and it is more effective in grading responses to unseen prompts, which have no corresponding responses in the training set.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128443452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}