Li-Juan Liu, Chuang Ding, Ya-Jun Hu, Zhenhua Ling, Yuan Jiang, M. Zhou, Si Wei
{"title":"The iFLYTEK system for blizzard machine learning challenge 2017-ES1","authors":"Li-Juan Liu, Chuang Ding, Ya-Jun Hu, Zhenhua Ling, Yuan Jiang, M. Zhou, Si Wei","doi":"10.1109/ASRU.2017.8268999","DOIUrl":null,"url":null,"abstract":"This paper introduces the speech synthesis system submitted by IFLYTEK for the Blizzard Machine Learning Challenge 2017-ES1. Linguistic and acoustic features from a 4hour corpus were released for this task. Participants are expected to build a speech synthesis system on the given linguist and acoustic features without using any external data. Our system is composed of a long short term memory (LSTM) recurrent neural network (RNN)-based acoustic model and a generative adversarial network (GAN)-based post-filter for mel-cepstra. Two approaches to build GAN-based post-filter are implemented and compared in our experiments. The first one is to predict the residuals of mel-cepstra given the mel-cepstra predicted by the LSTM-based acoustic model. However, this method leads to unstable synthetic speech sounds in our experiments, which may be due to the poor quality of analysis-synthesis speech using the natural acoustic features given by this corpus. The other approach is to ignore the detailed components of natural mel-cepstra by dimension reduction using principal component analysis (PCA) and then recover them back using GAN given the main PCA components. At synthesis time, mel-cepstra predicted by the RNN acoustic model are first projected to the main PCA components, which are then sent to the GAN for detail recovering. Finally, the second approach is used in the final submitted system. The evaluation results show the effectiveness of our submitted system.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASRU.2017.8268999","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
This paper introduces the speech synthesis system submitted by IFLYTEK for the Blizzard Machine Learning Challenge 2017-ES1. Linguistic and acoustic features from a 4hour corpus were released for this task. Participants are expected to build a speech synthesis system on the given linguist and acoustic features without using any external data. Our system is composed of a long short term memory (LSTM) recurrent neural network (RNN)-based acoustic model and a generative adversarial network (GAN)-based post-filter for mel-cepstra. Two approaches to build GAN-based post-filter are implemented and compared in our experiments. The first one is to predict the residuals of mel-cepstra given the mel-cepstra predicted by the LSTM-based acoustic model. However, this method leads to unstable synthetic speech sounds in our experiments, which may be due to the poor quality of analysis-synthesis speech using the natural acoustic features given by this corpus. The other approach is to ignore the detailed components of natural mel-cepstra by dimension reduction using principal component analysis (PCA) and then recover them back using GAN given the main PCA components. At synthesis time, mel-cepstra predicted by the RNN acoustic model are first projected to the main PCA components, which are then sent to the GAN for detail recovering. Finally, the second approach is used in the final submitted system. The evaluation results show the effectiveness of our submitted system.