V. Klimkov, A. Moinet, Adam Nadolski, Thomas Drugman
{"title":"基于递归神经网络的文本-语音合成参数生成算法","authors":"V. Klimkov, A. Moinet, Adam Nadolski, Thomas Drugman","doi":"10.1109/SLT.2018.8639626","DOIUrl":null,"url":null,"abstract":"Recurrent Neural Networks (RNN) have recently proved to be effective in acoustic modeling for TTS. Various techniques such as the Maximum Likelihood Parameter Generation (MLPG) algorithm have been naturally inherited from the HMM-based speech synthesis framework. This paper investigates in which situations parameter generation and variance restoration approaches help for RNN-based TTS. We explore how their performance is affected by various factors such as the choice of the loss function, the application of regularization methods and the amount of training data. We propose an efficient way to calculate MLPG using a convolutional kernel. Our results show that the use of the L1 loss with proper regularization outperforms any system built with the conventional L2 loss and does not require to apply MLPG (which is necessary otherwise). We did not observe perceptual improvements when embedding MLPG into the acoustic model. Finally, we show that variance restoration approaches are important for cepstral features but only yield minor perceptual gains for the prediction of F0.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"217 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Parameter Generation Algorithms for Text-To-Speech Synthesis with Recurrent Neural Networks\",\"authors\":\"V. Klimkov, A. Moinet, Adam Nadolski, Thomas Drugman\",\"doi\":\"10.1109/SLT.2018.8639626\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recurrent Neural Networks (RNN) have recently proved to be effective in acoustic modeling for TTS. Various techniques such as the Maximum Likelihood Parameter Generation (MLPG) algorithm have been naturally inherited from the HMM-based speech synthesis framework. This paper investigates in which situations parameter generation and variance restoration approaches help for RNN-based TTS. We explore how their performance is affected by various factors such as the choice of the loss function, the application of regularization methods and the amount of training data. We propose an efficient way to calculate MLPG using a convolutional kernel. Our results show that the use of the L1 loss with proper regularization outperforms any system built with the conventional L2 loss and does not require to apply MLPG (which is necessary otherwise). We did not observe perceptual improvements when embedding MLPG into the acoustic model. Finally, we show that variance restoration approaches are important for cepstral features but only yield minor perceptual gains for the prediction of F0.\",\"PeriodicalId\":377307,\"journal\":{\"name\":\"2018 IEEE Spoken Language Technology Workshop (SLT)\",\"volume\":\"217 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE Spoken Language Technology Workshop (SLT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SLT.2018.8639626\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE Spoken Language Technology Workshop (SLT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SLT.2018.8639626","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Parameter Generation Algorithms for Text-To-Speech Synthesis with Recurrent Neural Networks
Recurrent Neural Networks (RNN) have recently proved to be effective in acoustic modeling for TTS. Various techniques such as the Maximum Likelihood Parameter Generation (MLPG) algorithm have been naturally inherited from the HMM-based speech synthesis framework. This paper investigates in which situations parameter generation and variance restoration approaches help for RNN-based TTS. We explore how their performance is affected by various factors such as the choice of the loss function, the application of regularization methods and the amount of training data. We propose an efficient way to calculate MLPG using a convolutional kernel. Our results show that the use of the L1 loss with proper regularization outperforms any system built with the conventional L2 loss and does not require to apply MLPG (which is necessary otherwise). We did not observe perceptual improvements when embedding MLPG into the acoustic model. Finally, we show that variance restoration approaches are important for cepstral features but only yield minor perceptual gains for the prediction of F0.