{"title":"杜克大学参加2020暴雪挑战赛","authors":"Zexin Cai, Ming Li","doi":"10.21437/vcc_bc.2020-5","DOIUrl":null,"url":null,"abstract":"This paper presents the speech synthesis system built for the 2020 Blizzard Challenge by team ‘H’. The goal of the challenge is to build a synthesizer that is able to generate high-fidelity speech with a voice that is similar to the one from the provided data. Our system mainly draws on end-to-end neural networks. Specifically, we have an encoder-decoder based prosody prediction network to insert prosodic annotations for a given sentence. We use the spectrogram predictor from Tacotron2 as the end-toend phoneme-to-spectrogram generator, followed by the neural vocoder WaveRNN to convert predicted spectrograms to audio signals. Additionally, we involve finetuning strategics to improve the TTS performance in our work. Subjective evaluation of the synthetic audios is taken regarding naturalness, similarity, and intelligibility. Samples are available online for listening. 1","PeriodicalId":355114,"journal":{"name":"Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020","volume":"575 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"The Duke Entry for 2020 Blizzard Challenge\",\"authors\":\"Zexin Cai, Ming Li\",\"doi\":\"10.21437/vcc_bc.2020-5\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents the speech synthesis system built for the 2020 Blizzard Challenge by team ‘H’. The goal of the challenge is to build a synthesizer that is able to generate high-fidelity speech with a voice that is similar to the one from the provided data. Our system mainly draws on end-to-end neural networks. Specifically, we have an encoder-decoder based prosody prediction network to insert prosodic annotations for a given sentence. We use the spectrogram predictor from Tacotron2 as the end-toend phoneme-to-spectrogram generator, followed by the neural vocoder WaveRNN to convert predicted spectrograms to audio signals. Additionally, we involve finetuning strategics to improve the TTS performance in our work. Subjective evaluation of the synthetic audios is taken regarding naturalness, similarity, and intelligibility. Samples are available online for listening. 1\",\"PeriodicalId\":355114,\"journal\":{\"name\":\"Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020\",\"volume\":\"575 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-10-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21437/vcc_bc.2020-5\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/vcc_bc.2020-5","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
This paper presents the speech synthesis system built for the 2020 Blizzard Challenge by team ‘H’. The goal of the challenge is to build a synthesizer that is able to generate high-fidelity speech with a voice that is similar to the one from the provided data. Our system mainly draws on end-to-end neural networks. Specifically, we have an encoder-decoder based prosody prediction network to insert prosodic annotations for a given sentence. We use the spectrogram predictor from Tacotron2 as the end-toend phoneme-to-spectrogram generator, followed by the neural vocoder WaveRNN to convert predicted spectrograms to audio signals. Additionally, we involve finetuning strategics to improve the TTS performance in our work. Subjective evaluation of the synthetic audios is taken regarding naturalness, similarity, and intelligibility. Samples are available online for listening. 1