Fang-Yu Kuo, S. Aryal, G. Degottex, S. Kang, P. Lanchantin, I. Ouyang
{"title":"Data Selection for Improving Naturalness of TTS Voices Trained on Small Found Corpuses","authors":"Fang-Yu Kuo, S. Aryal, G. Degottex, S. Kang, P. Lanchantin, I. Ouyang","doi":"10.1109/SLT.2018.8639642","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639642","url":null,"abstract":"This work investigates techniques that select training data from small, found corpuses in order to improve the naturalness of synthesized text-to-speech voices. The approach outlined in this paper examines different metrics to detect and reject segments of training data that can degrade the performance of the system. We conducted experiments on two small datasets extracted from Mandarin Chinese audiobooks that have different characteristics in terms of recording conditions, narrator, and transcriptions. We show that using a even smaller, yet carefully selected, set of data can lead to a text-to-speech system able to generate more natural speech than a system trained on the complete dataset. Three metrics related to the narrator’s articulation proposed in the paper give significant improvements in naturalness.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"109 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133487444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Generating Semantic Similarity Atlas for Natural Languages","authors":"Lutfi Kerem Senel, Ihsan Utlu, Veysel Yücesoy, Aykut Koç, T. Çukur","doi":"10.1109/slt.2018.8639521","DOIUrl":"https://doi.org/10.1109/slt.2018.8639521","url":null,"abstract":"Cross-lingual studies attract a growing interest in natural language processing (NLP) research, and several studies showed that similar languages are more advantageous to work with than fundamentally different languages in transferring knowledge. Different similarity measures for the languages are proposed by researchers from different domains. However, a similarity measure focusing on semantic structures of languages can be useful for selecting pairs or groups of languages to work with, especially for the tasks requiring semantic knowledge such as sentiment analysis or word sense disambiguation. For this purpose, in this work, we leverage a recently proposed word embedding based method to generate a language similarity atlas for 76 different languages around the world. This atlas can help researchers select similar language pairs or groups in cross-lingual applications. Our findings suggest that semantic similarity between two languages is strongly correlated with the geographic proximity of the countries in which they are used.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133674787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dynamic Extension of ASR Lexicon Using Wikipedia Data","authors":"Badr M. Abdullah, I. Illina, D. Fohr","doi":"10.1109/SLT.2018.8639592","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639592","url":null,"abstract":"Despite recent progress in developing Large Vocabulary Continuous Speech Recognition Systems (LVCSR), these systems suffer from-Of-Vocabulary words (OOV). In many cases, the OOV words are Proper Nouns (PNs). The correct recognition of PNs is essential for broadcast news, audio indexing, etc. In this article, we address the problem of OOV PN retrieval in the framework of broadcast news LVCSR. We focused on dynamic (document dependent) extension of LVCSR lexicon. To retrieve relevant OOV PNs, we propose to use a very large multipurpose text corpus: Wikipedia. This corpus contains a huge number of PNs. These PNs are grouped in semantically similar classes using word embedding. We use a two-step approach: first, we select OOV PN pertinent classes with a multi-class Deep Neural Network (DNN). Secondly, we rank the OOVs of the selected classes. The experiments on French broadcast news show that the Bi-GRU model outperforms other studied models. Speech recognition experiments demonstrate the effectiveness of the proposed methodology.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"135 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114448738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Convolutional Neural Networks for Dialogue State Tracking without Pre-Trained Word Vectors or Semantic Dictionaries","authors":"M. Korpusik, James R. Glass","doi":"10.1109/SLT.2018.8639559","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639559","url":null,"abstract":"A crucial step in task-oriented dialogue systems is tracking the user’s goal over the course of the conversation. This involves maintaining a probability distribution over possible values for each slot (e.g., the foodslot might map to the value Turkish), which gets updated at each turn of the dialogue. Previously, rule-based methods were applied to dialogue systems, or models that required hand-crafted semantic dictionaries mapping phrases to those that are similar in meaning (e.g., areamight map to part of town). However, these are expensive to design for each domain, limiting the generalizability. In addition, often a spoken language understanding (SLU) component precedes the dialogue state update mechanism; however, this leads to compounded errors as the output from one module is passed to the next. Instead, more recent work has explored deep learning models for directly updating dialogue state, bypassing the need for SLU or expert-engineered rules. We demonstrate that a novel convolutional neural architecture without any pre-trained word vectors or semantic dictionaries achieves 86.9% joint goal accuracy and 95.4% requested slot accuracy on WOZ 2.0.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"141 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132833581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Occam’s Adaptation: A Comparison of Interpolation of Bases Adaptation Methods for Multi-Dialect Acoustic Modeling with LSTMS","authors":"M. Grace, M. Bastani, Eugene Weinstein","doi":"10.1109/SLT.2018.8639654","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639654","url":null,"abstract":"Multidialectal languages can pose challenges for acoustic modeling. Past research has shown that with a large training corpus but without explicit modeling of inter-dialect variability, training individual per-dialect models yields superior performance to that of a single model trained on the combined data [1, 2]. In this work, we were motivated by the idea that adaptation techniques can allow the models to learn dialect-independent features and in turn leverage the power of the larger training corpus sizes afforded when pooling data across dialects. Our goal was thus to create a single multidialect acoustic model that would rival the performance of the dialect-specific models.Working in the context of deep Long-Short Term Memory (LSTM) acoustic models trained on up to 40K hours of speech, we explored several methods for training and incorporating dialect-specific information into the model, including 12 variants of interpolation-of-bases techniques related to Cluster Adaptive Training (CAT) [3] and Factorized Hidden Layer (FHL) [4] techniques. We found that with our model topology and large training corpus, simply appending the dialect-specific information to the feature vector resulted in a more accurate model than any of the more complex interpolation-of-bases techniques, while requiring less model complexity and fewer parameters. This simple adaptation yielded a single unified model for all dialects that, in most cases, outperformed individual models which had been trained per-dialect.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123102181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jihwan Lee, Dongchan Kim, R. Sarikaya, Young-Bum Kim
{"title":"Coupled Representation Learning for Domains, Intents and Slots in Spoken Language Understanding","authors":"Jihwan Lee, Dongchan Kim, R. Sarikaya, Young-Bum Kim","doi":"10.1109/SLT.2018.8639581","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639581","url":null,"abstract":"Representation learning is an essential problem in a wide range of applications and it is important for performing downstream tasks successfully. In this paper, we propose a new model that learns coupled representations of domains, intents, and slots by taking advantage of their hierarchical dependency in a Spoken Language Understanding system. Our proposed model learns the vector representation of intents based on the slots tied to these intents by aggregating the representations of the slots. Similarly, the vector representation of a domain is learned by aggregating the representations of the intents tied to a specific domain. To the best of our knowledge, it is the first approach to jointly learning the representations of domains, intents, and slots using their hierarchical relationships. The experimental results demonstrate the effectiveness of the representations learned by our model, as evidenced by improved performance on the contextual cross-domain reranking task.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122258003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-Objective Multi-Task Learning on RNNLM for Speech Recognition","authors":"Minguang Song, Yunxin Zhao, Shaojun Wang","doi":"10.1109/SLT.2018.8639649","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639649","url":null,"abstract":"The cross entropy (CE) loss function is commonly adopted for neural network language model (NNLM) training. Although this criterion is largely successful, as evidenced by the quick advance of NNLM, minimizing CE only maximizes likelihood of training data. When training data is insufficient, the generalization power of the resulting LM is limited on test data. In this paper, we propose to integrate a pairwise ranking (PR) loss with the CE loss for multi-objective training on recurrent neural network language model (RNNLM). The PR loss emphasizes discrimination between target and non-target words and also reserves probabilities for low-frequency correct words, which complements the distribution learning role of the CE loss. Combining the two losses may therefore help improve the performance of RNNLM. In addition, we incorporate multi-task learning (MTL) into the proposed multi-objective learning to regularize the primary task of RNNLM by an auxiliary task of part-of-speech (POS) tagging. The proposed approach to RNNLM learning has been evaluated on two speech recognition tasks of WSJ and AMI with encouraging results achieved on word error rate reductions.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121044007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
V. Klimkov, A. Moinet, Adam Nadolski, Thomas Drugman
{"title":"Parameter Generation Algorithms for Text-To-Speech Synthesis with Recurrent Neural Networks","authors":"V. Klimkov, A. Moinet, Adam Nadolski, Thomas Drugman","doi":"10.1109/SLT.2018.8639626","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639626","url":null,"abstract":"Recurrent Neural Networks (RNN) have recently proved to be effective in acoustic modeling for TTS. Various techniques such as the Maximum Likelihood Parameter Generation (MLPG) algorithm have been naturally inherited from the HMM-based speech synthesis framework. This paper investigates in which situations parameter generation and variance restoration approaches help for RNN-based TTS. We explore how their performance is affected by various factors such as the choice of the loss function, the application of regularization methods and the amount of training data. We propose an efficient way to calculate MLPG using a convolutional kernel. Our results show that the use of the L1 loss with proper regularization outperforms any system built with the conventional L2 loss and does not require to apply MLPG (which is necessary otherwise). We did not observe perceptual improvements when embedding MLPG into the acoustic model. Finally, we show that variance restoration approaches are important for cepstral features but only yield minor perceptual gains for the prediction of F0.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"217 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116059539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Short Utterance Speaker Recognition by Reservoir with Self-Organized Mapping","authors":"Narumitsu Ikeda, Yoshinao Sato, Hirokazu Takahashi","doi":"10.1109/SLT.2018.8639570","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639570","url":null,"abstract":"Short utterances cause performance degradation in conventional speaker recognition systems based on i-vector, which relies on the statistics of spectral features. To overcome this difficulty, we propose a novel method that utilizes the dynamics of the spectral features as well as their distribution. Our model integrates echo state network (ESN), a type of reservoir computing architecture, and self-organizing map (SOM), a competitive learning network. The ESN consists of a single-hidden-layer recurrent neural network with randomly fixed weights, which extracts temporal patterns of the spectral features. The input weights of our model are trained using the unsupervised competitive learning algorithm of the SOM, before enrollment, to extract the intrinsic structure of the spectral features, whereas the input weights are fixed randomly in the original ESN. In enrollment, the output weights are trained in a supervised manner to recognize an individual in a group of speakers. Our experiment demonstrates that the proposed method outperforms or is comparable to a baseline i-vector system for text-independent speaker identification on short utterances.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128098956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cassia Valentini-Botinhao, O. Watts, Felipe Espic, Simon King
{"title":"Examplar-Based Speechwaveform Generation for Text-To-Speech","authors":"Cassia Valentini-Botinhao, O. Watts, Felipe Espic, Simon King","doi":"10.1109/SLT.2018.8639679","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639679","url":null,"abstract":"This paper presents a hybrid text-to-speech framework that uses a waveform generation method based on examplars of natural speech waveform. These examplars are selected at synthesis time given a sequence of acoustic features generated from text by a statistical parametric speech synthesis model. In order to match the expected degradation of these target synthesis features, the database of units is constructed such that the units’ target representations are generated from the same parametric model. We evaluate two variants of this framework by modifying the size of the examplar: a small unit variant (where unit boundaries are determined by pitch mark location) and a halfphone variant (where unit boundaries are determined by subphone state forced alignment). We found that for a larger dataset (around four hours of training data) the examplar-based waveform generation variants are rated higher than the vocoder-based system.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132337625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}