Guangming Li, Guanyu Li, Yugang Dai, Zhihao Song, Lin Meng
{"title":"Research on the realization of multilingual speech synthesis and cross-lingual sound cloning in Tibetan","authors":"Guangming Li, Guanyu Li, Yugang Dai, Zhihao Song, Lin Meng","doi":"10.1109/IIP57348.2022.00026","DOIUrl":null,"url":null,"abstract":"Speech synthesis technology has achieved rapid development in recent years, and the speech synthesized has reached a very high level of intelligibility and naturalness. However, once the speech to synthesize is mixed with words from other languages, the quality of the speech will be greatly compromised. Imagine how great it would be if one can hear foreign place names pronounced in the corresponding language very smoothly when navigating. Given that most of us can only speak one or two foreign languages due to time constraints, it would make a big difference to speak a foreign language in your voice. Implementing it using the existing monolingual model has difficulty in collecting sound data from someone who speaks different languages at the same time. Using only monolingual corpora, our model can do a good job of cloning one person’s voice and realizing code-switching. The parameters of the encoder are generated by a separate network based on a specific language vector, the parameter generator module consists of several specific parameter generators, each of which takes a language vector as input to generate the parameters of a layer of an encoder in a given language and to complete the sound cloning, we use an adversarial speaker classifier to eliminate specific speaker information in model training and the information will be going back in the synthesis. Our model performs very well on code-switching task and can synthesize high-quality speech with high accuracy.","PeriodicalId":412907,"journal":{"name":"2022 4th International Conference on Intelligent Information Processing (IIP)","volume":"234 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 4th International Conference on Intelligent Information Processing (IIP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IIP57348.2022.00026","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Speech synthesis technology has achieved rapid development in recent years, and the speech synthesized has reached a very high level of intelligibility and naturalness. However, once the speech to synthesize is mixed with words from other languages, the quality of the speech will be greatly compromised. Imagine how great it would be if one can hear foreign place names pronounced in the corresponding language very smoothly when navigating. Given that most of us can only speak one or two foreign languages due to time constraints, it would make a big difference to speak a foreign language in your voice. Implementing it using the existing monolingual model has difficulty in collecting sound data from someone who speaks different languages at the same time. Using only monolingual corpora, our model can do a good job of cloning one person’s voice and realizing code-switching. The parameters of the encoder are generated by a separate network based on a specific language vector, the parameter generator module consists of several specific parameter generators, each of which takes a language vector as input to generate the parameters of a layer of an encoder in a given language and to complete the sound cloning, we use an adversarial speaker classifier to eliminate specific speaker information in model training and the information will be going back in the synthesis. Our model performs very well on code-switching task and can synthesize high-quality speech with high accuracy.