{"title":"使用VAE-StarGAN和Attention-AdaIN提高少镜头跨语言语音转换效果","authors":"Dengfeng Ke, Wenhan Yao, Ruixin Hu, Liangjie Huang, Qi Luo, Wentao Shu","doi":"10.1109/SNPD54884.2022.10051811","DOIUrl":null,"url":null,"abstract":"Voice Conversion (VC) aims to transfer the speaker timbre while retaining the lexical content of the source speech and has attracted much attention lately. Although previous VC models have achieved good performance, unstability can not be avoided when it comes cross-lingual scenario. In this paper, we propose the StyleFormerGAN-VC to achieve better cross language speech conversion, where variational auto-encoder is introduced to model the feature distribution of the cross-lingual utterances and adversarial training is applied to elevate the speech quality. In addition, we combine the Attention mechanism and AdaIN to make our model more generalized to unseen speaker with long utterance. Experiments show that our model performs stably in the cross-lingual scenario and gains well MOS evaluation scores.","PeriodicalId":425462,"journal":{"name":"2022 IEEE/ACIS 23rd International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"StyleFormerGAN-VC:Improving Effect of few shot Cross-Lingual Voice Conversion Using VAE-StarGAN and Attention-AdaIN\",\"authors\":\"Dengfeng Ke, Wenhan Yao, Ruixin Hu, Liangjie Huang, Qi Luo, Wentao Shu\",\"doi\":\"10.1109/SNPD54884.2022.10051811\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Voice Conversion (VC) aims to transfer the speaker timbre while retaining the lexical content of the source speech and has attracted much attention lately. Although previous VC models have achieved good performance, unstability can not be avoided when it comes cross-lingual scenario. In this paper, we propose the StyleFormerGAN-VC to achieve better cross language speech conversion, where variational auto-encoder is introduced to model the feature distribution of the cross-lingual utterances and adversarial training is applied to elevate the speech quality. In addition, we combine the Attention mechanism and AdaIN to make our model more generalized to unseen speaker with long utterance. Experiments show that our model performs stably in the cross-lingual scenario and gains well MOS evaluation scores.\",\"PeriodicalId\":425462,\"journal\":{\"name\":\"2022 IEEE/ACIS 23rd International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD)\",\"volume\":\"42 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE/ACIS 23rd International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SNPD54884.2022.10051811\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE/ACIS 23rd International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SNPD54884.2022.10051811","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
StyleFormerGAN-VC:Improving Effect of few shot Cross-Lingual Voice Conversion Using VAE-StarGAN and Attention-AdaIN
Voice Conversion (VC) aims to transfer the speaker timbre while retaining the lexical content of the source speech and has attracted much attention lately. Although previous VC models have achieved good performance, unstability can not be avoided when it comes cross-lingual scenario. In this paper, we propose the StyleFormerGAN-VC to achieve better cross language speech conversion, where variational auto-encoder is introduced to model the feature distribution of the cross-lingual utterances and adversarial training is applied to elevate the speech quality. In addition, we combine the Attention mechanism and AdaIN to make our model more generalized to unseen speaker with long utterance. Experiments show that our model performs stably in the cross-lingual scenario and gains well MOS evaluation scores.