{"title":"StyleFormerGAN-VC:Improving Effect of few shot Cross-Lingual Voice Conversion Using VAE-StarGAN and Attention-AdaIN","authors":"Dengfeng Ke, Wenhan Yao, Ruixin Hu, Liangjie Huang, Qi Luo, Wentao Shu","doi":"10.1109/SNPD54884.2022.10051811","DOIUrl":null,"url":null,"abstract":"Voice Conversion (VC) aims to transfer the speaker timbre while retaining the lexical content of the source speech and has attracted much attention lately. Although previous VC models have achieved good performance, unstability can not be avoided when it comes cross-lingual scenario. In this paper, we propose the StyleFormerGAN-VC to achieve better cross language speech conversion, where variational auto-encoder is introduced to model the feature distribution of the cross-lingual utterances and adversarial training is applied to elevate the speech quality. In addition, we combine the Attention mechanism and AdaIN to make our model more generalized to unseen speaker with long utterance. Experiments show that our model performs stably in the cross-lingual scenario and gains well MOS evaluation scores.","PeriodicalId":425462,"journal":{"name":"2022 IEEE/ACIS 23rd International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE/ACIS 23rd International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SNPD54884.2022.10051811","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Voice Conversion (VC) aims to transfer the speaker timbre while retaining the lexical content of the source speech and has attracted much attention lately. Although previous VC models have achieved good performance, unstability can not be avoided when it comes cross-lingual scenario. In this paper, we propose the StyleFormerGAN-VC to achieve better cross language speech conversion, where variational auto-encoder is introduced to model the feature distribution of the cross-lingual utterances and adversarial training is applied to elevate the speech quality. In addition, we combine the Attention mechanism and AdaIN to make our model more generalized to unseen speaker with long utterance. Experiments show that our model performs stably in the cross-lingual scenario and gains well MOS evaluation scores.