{"title":"The Tencent speech synthesis system for Blizzard Challenge 2020","authors":"Qiao Tian, Zewang Zhang, Linghui Chen, Heng Lu, Chengzhu Yu, Chao Weng, Dong Yu","doi":"10.21437/vcc_bc.2020-4","DOIUrl":"https://doi.org/10.21437/vcc_bc.2020-4","url":null,"abstract":"This paper presents the Tencent speech synthesis system for Blizzard Challenge 2020. The corpus released to the partici-pants this year included a TV’s news broadcasting corpus with a length around 8 hours by a Chinese male host (2020-MH1 task), and a Shanghaiese speech corpus with a length around 6 hours (2020-SS1 task). We built a DurIAN-based speech synthesis system for 2020-MH1 task and Tacotron-based system for 2020-SS1 task. For 2020-MH1 task, firstly, a multi-speaker DurIAN-based acoustic model was trained based on linguistic feature to predict mel spectrograms. Then the model was fine-tuned on only the corpus provided. For 2020-SS1 task, instead of training based on hard-aligned phone boundaries, a Tacotron-like end-to-end system is applied to learn the mappings between phonemes and mel spectrograms. Finally, a modified version of WaveRNN model conditioning on the predicted mel spectrograms is trained to generate speech waveform. Our team is identified as L and the evaluation results shows our systems perform very well in various tests. Especially, we took the first place in the overall speech intelligibility test.","PeriodicalId":355114,"journal":{"name":"Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126280216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Ajmide Text-To-Speech System for Blizzard Challenge 2020","authors":"Beibei Hu, Zilong Bai, Qiang Li","doi":"10.21437/vcc_bc.2020-13","DOIUrl":"https://doi.org/10.21437/vcc_bc.2020-13","url":null,"abstract":"This paper presents the Ajmide team’s text-to-speech system for the task MH1 of Blizzard Challenge 2020. The task is to build a voice from about 9.5 hours of speech from a male native speaker of Mandarin. We built a speech synthesis system in an end-to-end style. The system consists of a BERT-based text front end that process both Chinese and English texts, a multi-speaker Tacotron2 model that converts the phoneme and linguistic feature sequence into mel spectrogram, and a modified WaveRNN vocoder that generate the audio waveform from the mel spectrogram. The listening evaluation results show that our system, identified by P, performs well in terms of naturalness, intelligibility and the aspects of intonation, emotion and listening effort.","PeriodicalId":355114,"journal":{"name":"Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114927064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Non-Parallel Voice Conversion with Autoregressive Conversion Model and Duration Adjustment","authors":"Li-Juan Liu, Yan-Nian Chen, Jing-Xuan Zhang, Yuan Jiang, Ya-Jun Hu, Zhenhua Ling, Lirong Dai","doi":"10.21437/vcc_bc.2020-17","DOIUrl":"https://doi.org/10.21437/vcc_bc.2020-17","url":null,"abstract":"Although N10 system in Voice Conversion Challenge 2018 (VCC 18) has achieved excellent voice conversion results in both speech naturalness and speaker similarity, the sys-tem’s performance is limited due to some modeling insuffi-ciency. In this paper, we propose to overcome these limita-tions by introducing three modifications. First, we substitute an autoregressive-based model in order to improve the conversion model capability; second, we use high-fidelity WaveNet to model 24kHz/16bit waveform in order to improve conversion speech naturalness; third, a duration adjustment strategy is proposed to compensate the obvious speech rate difference between source and target speakers. Experimental results show that our proposed method can improve the conversion performance significantly. Furthermore, we validate the performance of this system for cross-lingual voice conversion by applying it directly to the cross-lingual task in Voice Conversion Challenge 2020 (VCC 2020). The released official subjective results show that our system obtains the best performance in conversion speech naturalness and comparable performance to the best system in speaker similarity, which indicate that our proposed method can achieve state-of-the-art cross-lingual voice conversion performance as well.","PeriodicalId":355114,"journal":{"name":"Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127870314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The RoyalFlush Synthesis System for Blizzard Challenge 2020","authors":"Jian Lu, Zeru Lu, Ting-ting He, Peng Zhang, Xinhui Hu, Xinkang Xu","doi":"10.21437/vcc_bc.2020-9","DOIUrl":"https://doi.org/10.21437/vcc_bc.2020-9","url":null,"abstract":"The paper presents the RoyalFlush synthesis system for Blizzard Challenge 2020. Two required voices are built from the released Mandarin and Shanghainese data. Based on end-to-end speech synthesis technology, some improvements are introduced to the system compared with our system of last year. Firstly, a Mandarin front-end transforming input text into phoneme sequence along with prosody labels is employed. Then, to improve speech stability, a modified Tacotron acoustic model is proposed. Moreover, we apply GMM-based attention mechanism for robust long-form speech synthesis. Finally, a lightweight LPCNet-based neural vocoder is adopted to achieve a nice traceoff between effectiveness and efficiency. Among all the participating teams of the Challenge, the i-dentifier for our system is N. Evaluation results demonstrates that our system performs relatively well in intelligibility. But it still needs to be improved in terms of naturalness and similarity.","PeriodicalId":355114,"journal":{"name":"Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125440685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yang Song, Min-Siong Liang, Guilin Yang, Kun Xie, Jie Hao
{"title":"The OPPO System for the Blizzard Challenge 2020","authors":"Yang Song, Min-Siong Liang, Guilin Yang, Kun Xie, Jie Hao","doi":"10.21437/vcc_bc.2020-3","DOIUrl":"https://doi.org/10.21437/vcc_bc.2020-3","url":null,"abstract":"This paper presents the OPPO text-to-speech system for Blizzard Challenge 2020. A statistical parametric speech synthesis based system was built with improvements in both frontend and backend. For the Mandarin task, a BERT model was used for the frontend, a Tacotron acoustic model and a WaveRNN vocoder model were used for the backend. For the Shanghainese task, the frontend was built from scratch, a Tacotron acoustic model and a MelGAN vocoder model were used for the backend. For the Mandarin task, evaluation results showed that our proposed system performed best in naturalness, and achieved near-best results in similarity. For the Shanghainese task, we got poor results in most indicators.","PeriodicalId":355114,"journal":{"name":"Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020","volume":"247 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122580324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Blizzard Challenge 2020","authors":"Xiao Zhou, Zhenhao Ling, Simon King","doi":"10.21437/vcc_bc.2020-1","DOIUrl":"https://doi.org/10.21437/vcc_bc.2020-1","url":null,"abstract":"The Blizzard Challenge 2020 is the sixteenth annual Blizzard Challenge. The challenge this year includes a hub task of synthesizing Mandarin speech and a spoke task of synthesizing Shanghainese speech. The speech data of these two Chinese dialects as well as corresponding text transcriptions were provided. Sixteen and eight teams participated in the two tasks respectively. Listening tests were conducted online to evaluate the performance of synthetic speech.","PeriodicalId":355114,"journal":{"name":"Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132777834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CASIA Voice Conversion System for the Voice Conversion Challenge 2020","authors":"Lian Zheng, J. Tao, Zhengqi Wen, Rongxiu Zhong","doi":"10.21437/vcc_bc.2020-19","DOIUrl":"https://doi.org/10.21437/vcc_bc.2020-19","url":null,"abstract":"This paper presents our CASIA (Chinese Academy of Sciences, Institute of Automation) voice conversion system for the Voice Conversation Challenge 2020 (VCC 2020). The CASIA voice conversion system can be separated into two modules: the conversion model and the vocoder. We first extract linguistic features from the source speech. Then, the conversion model takes these linguistic features as the inputs, aiming to predict the acoustic features of the target speaker. Finally, the vocoder utilizes these predicted features to generate the speech waveform of the target speaker. In our system, we utilize the CBHG conversion model and the LPCNet vocoder for speech generation. To better control the prosody of the converted speech, we utilize acoustic features of the source speech as additional inputs, including the pitch, voiced/unvoiced flag and band aperiodicity. Since the training data is limited in VCC 2020, we build our system by combining the initialization using a multi-speaker data and the adaptation using limited data of the target speaker. The results of VCC 2020 rank our CASIA system in the second place with an overall mean opinion score of 3.99 for speaker quality and 84% accuracy for speaker similarity.","PeriodicalId":355114,"journal":{"name":"Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020","volume":"125 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130348784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The HITSZ TTS system for Blizzard challenge 2020","authors":"Huhao Fu, Yiben Zhang, Kai Liu, Chao Liu","doi":"10.21437/vcc_bc.2020-11","DOIUrl":"https://doi.org/10.21437/vcc_bc.2020-11","url":null,"abstract":"In this paper, we present the techniques that were used in HITSZ-TTS 1 entry in Blizzard Challenge 2020. The corpus released to the participants this year is about 10-hours speech recordings from a Chinese male speaker with mixed Mandarin and English speech. Based on the above situation, we build an end to end speech synthesis system for this task. It is divided into the following parts: (1) the front-end module to analyze the pronunciation and prosody of text; (2) The phoneme-converted tool; (3) The forward-attention based sequence-to-sequence acoustic model with jointly learning with prosody labels to predict 80-dimensional Mel-spectrogram; (4) The Parallel WaveGAN based neural vocoder to reconstruct waveforms. This is the first time for us to join the Blizzard Challenge, and the identifier for our system is G. The evaluation results of subjective listening tests show that the proposed system achieves unsatisfactory performance. The problems in the system are also discussed in this paper.","PeriodicalId":355114,"journal":{"name":"Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020","volume":"125 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115608591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaohai Tian, Zhichao Wang, Shan Yang, Xinyong Zhou, Hongqiang Du, Yi Zhou, Mingyang Zhang, Kun Zhou, Berrak Sisman, Lei Xie, Haizhou Li
{"title":"The NUS and NWPU system for Voice Conversion Challenge 2020","authors":"Xiaohai Tian, Zhichao Wang, Shan Yang, Xinyong Zhou, Hongqiang Du, Yi Zhou, Mingyang Zhang, Kun Zhou, Berrak Sisman, Lei Xie, Haizhou Li","doi":"10.21437/vcc_bc.2020-26","DOIUrl":"https://doi.org/10.21437/vcc_bc.2020-26","url":null,"abstract":"","PeriodicalId":355114,"journal":{"name":"Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020","volume":"109 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132572963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L. He, Q. Shi, Lang Wu, Jianqing Sun, Renke He, Yanhua Long, Jiaen Liang
{"title":"The SHNU System for Blizzard Challenge 2020","authors":"L. He, Q. Shi, Lang Wu, Jianqing Sun, Renke He, Yanhua Long, Jiaen Liang","doi":"10.21437/vcc_bc.2020-2","DOIUrl":"https://doi.org/10.21437/vcc_bc.2020-2","url":null,"abstract":"This paper introduces the SHNU (team I) speech synthesis system for Blizzard Challenge 2020. Speech data released this year includes two parts: a 9.5-hour Mandarin corpus from a male native speaker and a 3-hour Shanghainese corpus from a female native speaker. Based on these corpora, we built two neural network-based speech synthesis systems to synthesize speech for both tasks. The same system architecture was used for both the Mandarin and Shanghainese tasks. Specifically, our systems include a front-end module, a Tacotron-based spectrogram prediction network and a WaveNet-based neural vocoder. Firstly, a pre-built front-end module was used to generate character sequence and linguistic features from the training text. Then, we applied a Tacotron-based sequence-to-sequence model to generate mel-spectrogram from character sequence. Finally, a WaveNet-based neural vocoder was adopted to reconstruct audio waveform with the mel-spectrogram from Tacotron. Evaluation results demonstrated that our system achieved an extremely good performance on both tasks, which proved the effectiveness of our proposed system.","PeriodicalId":355114,"journal":{"name":"Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020","volume":"234 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116418857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}