中文文本到语音和语音克隆的非自回归网络

2021 IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA) Pub Date : 2021-06-28 DOI:10.1109/ICAICA52286.2021.9497934

Chun Zhang, Yueqing Cai, Wenbi Rao

{"title":"中文文本到语音和语音克隆的非自回归网络","authors":"Chun Zhang, Yueqing Cai, Wenbi Rao","doi":"10.1109/ICAICA52286.2021.9497934","DOIUrl":null,"url":null,"abstract":"Text to speech (TTS) has been evolving rapidly these years. Researchers have successfully converted English text into speech which sounds like natural speaker, proposing numerous models from RNN to non-autoregressive network. However, the migration of these models to Chinese TTS is still an issue because of its prosodic phrasing problems and large character set, not to mention the disappointing outcomes of those successfully-migrated models, most of which are autoregressive. In this paper, we successfully migrate FastSpeech2 to the field of Chinese TTS with generative adversarial network (GAN) as its discriminator for training to enhance the outcome. Postnet of Tactron2 is also applied to fine-tune the mel-spectrogram. We also use x-vector-based voiceprint extraction model to extract voiceprint to achieve voice cloning. The experiment is operated on both models which offers results of 3.83 mean opinion score (MOS) in terms of naturalness and 3.82 MOS in terms of similarity.","PeriodicalId":121979,"journal":{"name":"2021 IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"A Non-Autoregressivee Network for Chinese Text to Speech and Voice Cloning\",\"authors\":\"Chun Zhang, Yueqing Cai, Wenbi Rao\",\"doi\":\"10.1109/ICAICA52286.2021.9497934\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Text to speech (TTS) has been evolving rapidly these years. Researchers have successfully converted English text into speech which sounds like natural speaker, proposing numerous models from RNN to non-autoregressive network. However, the migration of these models to Chinese TTS is still an issue because of its prosodic phrasing problems and large character set, not to mention the disappointing outcomes of those successfully-migrated models, most of which are autoregressive. In this paper, we successfully migrate FastSpeech2 to the field of Chinese TTS with generative adversarial network (GAN) as its discriminator for training to enhance the outcome. Postnet of Tactron2 is also applied to fine-tune the mel-spectrogram. We also use x-vector-based voiceprint extraction model to extract voiceprint to achieve voice cloning. The experiment is operated on both models which offers results of 3.83 mean opinion score (MOS) in terms of naturalness and 3.82 MOS in terms of similarity.\",\"PeriodicalId\":121979,\"journal\":{\"name\":\"2021 IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA)\",\"volume\":\"15 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-06-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICAICA52286.2021.9497934\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICAICA52286.2021.9497934","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

文本到语音(TTS)技术近年来发展迅速。研究人员已经成功地将英语文本转化为听起来像自然说话者的语音，并提出了从RNN到非自回归网络的许多模型。然而，由于汉语TTS的韵律短语问题和庞大的字符集，这些模型的迁移仍然是一个问题，更不用说那些成功迁移的模型的结果令人失望，其中大多数是自回归的。在本文中，我们成功地将FastSpeech2移植到中文TTS领域，并使用生成对抗网络(GAN)作为其训练判别器来增强结果。利用Tactron2的Postnet对mel谱图进行微调。我们还利用基于x向量的声纹提取模型提取声纹，实现语音克隆。实验对两个模型进行了操作，自然度的平均意见得分(MOS)为3.83，相似度的平均意见得分为3.82。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Non-Autoregressivee Network for Chinese Text to Speech and Voice Cloning

Text to speech (TTS) has been evolving rapidly these years. Researchers have successfully converted English text into speech which sounds like natural speaker, proposing numerous models from RNN to non-autoregressive network. However, the migration of these models to Chinese TTS is still an issue because of its prosodic phrasing problems and large character set, not to mention the disappointing outcomes of those successfully-migrated models, most of which are autoregressive. In this paper, we successfully migrate FastSpeech2 to the field of Chinese TTS with generative adversarial network (GAN) as its discriminator for training to enhance the outcome. Postnet of Tactron2 is also applied to fine-tune the mel-spectrogram. We also use x-vector-based voiceprint extraction model to extract voiceprint to achieve voice cloning. The experiment is operated on both models which offers results of 3.83 mean opinion score (MOS) in terms of naturalness and 3.82 MOS in terms of similarity.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA)

自引率

0.00%

发文量