Research on the realization of multilingual speech synthesis and cross-lingual sound cloning in Tibetan

2022 4th International Conference on Intelligent Information Processing (IIP) Pub Date : 2022-10-01 DOI:10.1109/IIP57348.2022.00026

Guangming Li, Guanyu Li, Yugang Dai, Zhihao Song, Lin Meng

{"title":"Research on the realization of multilingual speech synthesis and cross-lingual sound cloning in Tibetan","authors":"Guangming Li, Guanyu Li, Yugang Dai, Zhihao Song, Lin Meng","doi":"10.1109/IIP57348.2022.00026","DOIUrl":null,"url":null,"abstract":"Speech synthesis technology has achieved rapid development in recent years, and the speech synthesized has reached a very high level of intelligibility and naturalness. However, once the speech to synthesize is mixed with words from other languages, the quality of the speech will be greatly compromised. Imagine how great it would be if one can hear foreign place names pronounced in the corresponding language very smoothly when navigating. Given that most of us can only speak one or two foreign languages due to time constraints, it would make a big difference to speak a foreign language in your voice. Implementing it using the existing monolingual model has difficulty in collecting sound data from someone who speaks different languages at the same time. Using only monolingual corpora, our model can do a good job of cloning one person’s voice and realizing code-switching. The parameters of the encoder are generated by a separate network based on a specific language vector, the parameter generator module consists of several specific parameter generators, each of which takes a language vector as input to generate the parameters of a layer of an encoder in a given language and to complete the sound cloning, we use an adversarial speaker classifier to eliminate specific speaker information in model training and the information will be going back in the synthesis. Our model performs very well on code-switching task and can synthesize high-quality speech with high accuracy.","PeriodicalId":412907,"journal":{"name":"2022 4th International Conference on Intelligent Information Processing (IIP)","volume":"234 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 4th International Conference on Intelligent Information Processing (IIP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IIP57348.2022.00026","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Speech synthesis technology has achieved rapid development in recent years, and the speech synthesized has reached a very high level of intelligibility and naturalness. However, once the speech to synthesize is mixed with words from other languages, the quality of the speech will be greatly compromised. Imagine how great it would be if one can hear foreign place names pronounced in the corresponding language very smoothly when navigating. Given that most of us can only speak one or two foreign languages due to time constraints, it would make a big difference to speak a foreign language in your voice. Implementing it using the existing monolingual model has difficulty in collecting sound data from someone who speaks different languages at the same time. Using only monolingual corpora, our model can do a good job of cloning one person’s voice and realizing code-switching. The parameters of the encoder are generated by a separate network based on a specific language vector, the parameter generator module consists of several specific parameter generators, each of which takes a language vector as input to generate the parameters of a layer of an encoder in a given language and to complete the sound cloning, we use an adversarial speaker classifier to eliminate specific speaker information in model training and the information will be going back in the synthesis. Our model performs very well on code-switching task and can synthesize high-quality speech with high accuracy.

查看原文本刊更多论文

藏语多语语音合成及跨语语音克隆的实现研究

语音合成技术近年来取得了飞速的发展，合成的语音在可理解性和自然度方面已经达到了很高的水平。然而，一旦要合成的语音与其他语言的单词混合在一起，语音的质量就会大打折扣。想象一下，如果一个人在导航时能够非常流畅地听到相应语言的外国地名发音，那该有多好。考虑到我们大多数人由于时间的限制只能说一两门外语，用你的声音说外语将会有很大的不同。使用现有的单语言模型来实现它，很难从同时说不同语言的人那里收集声音数据。仅使用单语语料库，我们的模型就可以很好地克隆一个人的语音并实现代码切换。编码器的参数由基于特定语言向量的单独网络生成，参数生成器模块由多个特定参数生成器组成，每个参数生成器以一个语言向量为输入，生成给定语言的编码器的一层参数并完成声音克隆，我们使用对抗性说话人分类器在模型训练中消除特定说话人信息，这些信息将在合成中返回。该模型在码换任务上表现良好，能够合成高质量的语音，且准确率高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 4th International Conference on Intelligent Information Processing (IIP)

自引率

0.00%

发文量