The iFLYTEK system for blizzard machine learning challenge 2017-ES1

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI:10.1109/ASRU.2017.8268999

Li-Juan Liu, Chuang Ding, Ya-Jun Hu, Zhenhua Ling, Yuan Jiang, M. Zhou, Si Wei

{"title":"The iFLYTEK system for blizzard machine learning challenge 2017-ES1","authors":"Li-Juan Liu, Chuang Ding, Ya-Jun Hu, Zhenhua Ling, Yuan Jiang, M. Zhou, Si Wei","doi":"10.1109/ASRU.2017.8268999","DOIUrl":null,"url":null,"abstract":"This paper introduces the speech synthesis system submitted by IFLYTEK for the Blizzard Machine Learning Challenge 2017-ES1. Linguistic and acoustic features from a 4hour corpus were released for this task. Participants are expected to build a speech synthesis system on the given linguist and acoustic features without using any external data. Our system is composed of a long short term memory (LSTM) recurrent neural network (RNN)-based acoustic model and a generative adversarial network (GAN)-based post-filter for mel-cepstra. Two approaches to build GAN-based post-filter are implemented and compared in our experiments. The first one is to predict the residuals of mel-cepstra given the mel-cepstra predicted by the LSTM-based acoustic model. However, this method leads to unstable synthetic speech sounds in our experiments, which may be due to the poor quality of analysis-synthesis speech using the natural acoustic features given by this corpus. The other approach is to ignore the detailed components of natural mel-cepstra by dimension reduction using principal component analysis (PCA) and then recover them back using GAN given the main PCA components. At synthesis time, mel-cepstra predicted by the RNN acoustic model are first projected to the main PCA components, which are then sent to the GAN for detail recovering. Finally, the second approach is used in the final submitted system. The evaluation results show the effectiveness of our submitted system.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASRU.2017.8268999","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

This paper introduces the speech synthesis system submitted by IFLYTEK for the Blizzard Machine Learning Challenge 2017-ES1. Linguistic and acoustic features from a 4hour corpus were released for this task. Participants are expected to build a speech synthesis system on the given linguist and acoustic features without using any external data. Our system is composed of a long short term memory (LSTM) recurrent neural network (RNN)-based acoustic model and a generative adversarial network (GAN)-based post-filter for mel-cepstra. Two approaches to build GAN-based post-filter are implemented and compared in our experiments. The first one is to predict the residuals of mel-cepstra given the mel-cepstra predicted by the LSTM-based acoustic model. However, this method leads to unstable synthetic speech sounds in our experiments, which may be due to the poor quality of analysis-synthesis speech using the natural acoustic features given by this corpus. The other approach is to ignore the detailed components of natural mel-cepstra by dimension reduction using principal component analysis (PCA) and then recover them back using GAN given the main PCA components. At synthesis time, mel-cepstra predicted by the RNN acoustic model are first projected to the main PCA components, which are then sent to the GAN for detail recovering. Finally, the second approach is used in the final submitted system. The evaluation results show the effectiveness of our submitted system.

查看原文本刊更多论文

科大讯飞系统参加暴雪机器学习挑战赛2017-ES1

本文介绍了科大讯飞为暴雪机器学习挑战赛2017-ES1提交的语音合成系统。为了完成这项任务，我们发布了一个4小时语料库的语言和声学特征。参与者需要在不使用任何外部数据的情况下，根据给定的语言学家和声学特征构建语音合成系统。该系统由基于长短期记忆(LSTM)循环神经网络(RNN)的声学模型和基于生成对抗网络(GAN)的mel-cepstra后滤波器组成。在实验中，我们实现并比较了两种构建gan后滤波器的方法。首先，根据基于lstm的声学模型预测的倒梅尔谱，预测倒梅尔谱的残差。然而，在我们的实验中，这种方法导致合成语音不稳定，这可能是由于使用该语料库给出的自然声学特征进行分析合成语音的质量较差。另一种方法是使用主成分分析(PCA)通过降维来忽略自然mel-cepstra的详细成分，然后在给定主成分的情况下使用GAN恢复它们。在合成时，首先将RNN声学模型预测的mel-cepstra投影到主PCA分量上，然后将主PCA分量发送给GAN进行细节恢复。最后，在最后提交的系统中使用第二种方法。评价结果表明了所提交系统的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

自引率

0.00%

发文量