中文文本到语音和语音克隆的非自回归网络

Chun Zhang, Yueqing Cai, Wenbi Rao
{"title":"中文文本到语音和语音克隆的非自回归网络","authors":"Chun Zhang, Yueqing Cai, Wenbi Rao","doi":"10.1109/ICAICA52286.2021.9497934","DOIUrl":null,"url":null,"abstract":"Text to speech (TTS) has been evolving rapidly these years. Researchers have successfully converted English text into speech which sounds like natural speaker, proposing numerous models from RNN to non-autoregressive network. However, the migration of these models to Chinese TTS is still an issue because of its prosodic phrasing problems and large character set, not to mention the disappointing outcomes of those successfully-migrated models, most of which are autoregressive. In this paper, we successfully migrate FastSpeech2 to the field of Chinese TTS with generative adversarial network (GAN) as its discriminator for training to enhance the outcome. Postnet of Tactron2 is also applied to fine-tune the mel-spectrogram. We also use x-vector-based voiceprint extraction model to extract voiceprint to achieve voice cloning. The experiment is operated on both models which offers results of 3.83 mean opinion score (MOS) in terms of naturalness and 3.82 MOS in terms of similarity.","PeriodicalId":121979,"journal":{"name":"2021 IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"A Non-Autoregressivee Network for Chinese Text to Speech and Voice Cloning\",\"authors\":\"Chun Zhang, Yueqing Cai, Wenbi Rao\",\"doi\":\"10.1109/ICAICA52286.2021.9497934\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Text to speech (TTS) has been evolving rapidly these years. Researchers have successfully converted English text into speech which sounds like natural speaker, proposing numerous models from RNN to non-autoregressive network. However, the migration of these models to Chinese TTS is still an issue because of its prosodic phrasing problems and large character set, not to mention the disappointing outcomes of those successfully-migrated models, most of which are autoregressive. In this paper, we successfully migrate FastSpeech2 to the field of Chinese TTS with generative adversarial network (GAN) as its discriminator for training to enhance the outcome. Postnet of Tactron2 is also applied to fine-tune the mel-spectrogram. We also use x-vector-based voiceprint extraction model to extract voiceprint to achieve voice cloning. The experiment is operated on both models which offers results of 3.83 mean opinion score (MOS) in terms of naturalness and 3.82 MOS in terms of similarity.\",\"PeriodicalId\":121979,\"journal\":{\"name\":\"2021 IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA)\",\"volume\":\"15 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-06-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICAICA52286.2021.9497934\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICAICA52286.2021.9497934","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

文本到语音(TTS)技术近年来发展迅速。研究人员已经成功地将英语文本转化为听起来像自然说话者的语音,并提出了从RNN到非自回归网络的许多模型。然而,由于汉语TTS的韵律短语问题和庞大的字符集,这些模型的迁移仍然是一个问题,更不用说那些成功迁移的模型的结果令人失望,其中大多数是自回归的。在本文中,我们成功地将FastSpeech2移植到中文TTS领域,并使用生成对抗网络(GAN)作为其训练判别器来增强结果。利用Tactron2的Postnet对mel谱图进行微调。我们还利用基于x向量的声纹提取模型提取声纹,实现语音克隆。实验对两个模型进行了操作,自然度的平均意见得分(MOS)为3.83,相似度的平均意见得分为3.82。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
A Non-Autoregressivee Network for Chinese Text to Speech and Voice Cloning
Text to speech (TTS) has been evolving rapidly these years. Researchers have successfully converted English text into speech which sounds like natural speaker, proposing numerous models from RNN to non-autoregressive network. However, the migration of these models to Chinese TTS is still an issue because of its prosodic phrasing problems and large character set, not to mention the disappointing outcomes of those successfully-migrated models, most of which are autoregressive. In this paper, we successfully migrate FastSpeech2 to the field of Chinese TTS with generative adversarial network (GAN) as its discriminator for training to enhance the outcome. Postnet of Tactron2 is also applied to fine-tune the mel-spectrogram. We also use x-vector-based voiceprint extraction model to extract voiceprint to achieve voice cloning. The experiment is operated on both models which offers results of 3.83 mean opinion score (MOS) in terms of naturalness and 3.82 MOS in terms of similarity.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信