多对多语音转换中的口音和说话人解纠缠

2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2020-11-17 DOI:10.1109/ISCSLP49672.2021.9362120

Zhichao Wang, Wenshuo Ge, Xiong Wang, Shan Yang, Wendong Gan, Haitao Chen, Hai Li, Lei Xie, Xiulin Li

{"title":"多对多语音转换中的口音和说话人解纠缠","authors":"Zhichao Wang, Wenshuo Ge, Xiong Wang, Shan Yang, Wendong Gan, Haitao Chen, Hai Li, Lei Xie, Xiulin Li","doi":"10.1109/ISCSLP49672.2021.9362120","DOIUrl":null,"url":null,"abstract":"This paper proposes an interesting voice and accent joint conversion approach, which can convert an arbitrary source speaker’s voice to a target speaker with non-native accent. This problem is challenging as each target speaker only has training data in native accent and we need to disentangle accent and speaker information in the conversion model training and re-combine them in the conversion stage. In our recognition-synthesis conversion framework, we manage to solve this problem by two proposed tricks. First, we use accent-dependent speech recognizers to obtain bottleneck features for different accented speakers. This aims to wipe out other factors beyond the linguistic information in the BN features for conversion model training. Second, we propose to use adversarial training to better disentangle the speaker and accent information in our encoder-decoder based conversion model. Specifically, we plug an auxiliary speaker classifier to the encoder, trained with an adversarial loss to wipe out speaker information from the encoder output. Experiments show that our approach is superior to the baseline. The proposed tricks are quite effective in improving accentedness and audio quality and speaker similarity are well maintained.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"22","resultStr":"{\"title\":\"Accent and Speaker Disentanglement in Many-to-many Voice Conversion\",\"authors\":\"Zhichao Wang, Wenshuo Ge, Xiong Wang, Shan Yang, Wendong Gan, Haitao Chen, Hai Li, Lei Xie, Xiulin Li\",\"doi\":\"10.1109/ISCSLP49672.2021.9362120\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper proposes an interesting voice and accent joint conversion approach, which can convert an arbitrary source speaker’s voice to a target speaker with non-native accent. This problem is challenging as each target speaker only has training data in native accent and we need to disentangle accent and speaker information in the conversion model training and re-combine them in the conversion stage. In our recognition-synthesis conversion framework, we manage to solve this problem by two proposed tricks. First, we use accent-dependent speech recognizers to obtain bottleneck features for different accented speakers. This aims to wipe out other factors beyond the linguistic information in the BN features for conversion model training. Second, we propose to use adversarial training to better disentangle the speaker and accent information in our encoder-decoder based conversion model. Specifically, we plug an auxiliary speaker classifier to the encoder, trained with an adversarial loss to wipe out speaker information from the encoder output. Experiments show that our approach is superior to the baseline. The proposed tricks are quite effective in improving accentedness and audio quality and speaker similarity are well maintained.\",\"PeriodicalId\":279828,\"journal\":{\"name\":\"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-11-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"22\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISCSLP49672.2021.9362120\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISCSLP49672.2021.9362120","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 22

摘要

本文提出了一种有趣的语音和口音联合转换方法，该方法可以将任意源说话者的语音转换为具有非母语口音的目标说话者。这个问题很有挑战性，因为每个目标说话者只有母语口音的训练数据，我们需要在转换模型训练中将口音和说话者信息分离，并在转换阶段将它们重新组合。在我们的识别-合成转换框架中，我们通过提出两个技巧来解决这个问题。首先，我们使用依赖于口音的语音识别器来获取不同口音说话者的瓶颈特征。其目的是消除BN特征中语言信息以外的其他因素，用于转换模型的训练。其次，我们建议在基于编码器-解码器的转换模型中使用对抗性训练来更好地分离说话者和口音信息。具体来说，我们将一个辅助的说话人分类器插入编码器，用对抗损失训练从编码器输出中消除说话人信息。实验表明，该方法优于基线方法。所提出的技巧在改善口音、音质和说话者相似度方面非常有效。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Accent and Speaker Disentanglement in Many-to-many Voice Conversion

This paper proposes an interesting voice and accent joint conversion approach, which can convert an arbitrary source speaker’s voice to a target speaker with non-native accent. This problem is challenging as each target speaker only has training data in native accent and we need to disentangle accent and speaker information in the conversion model training and re-combine them in the conversion stage. In our recognition-synthesis conversion framework, we manage to solve this problem by two proposed tricks. First, we use accent-dependent speech recognizers to obtain bottleneck features for different accented speakers. This aims to wipe out other factors beyond the linguistic information in the BN features for conversion model training. Second, we propose to use adversarial training to better disentangle the speaker and accent information in our encoder-decoder based conversion model. Specifically, we plug an auxiliary speaker classifier to the encoder, trained with an adversarial loss to wipe out speaker information from the encoder output. Experiments show that our approach is superior to the baseline. The proposed tricks are quite effective in improving accentedness and audio quality and speaker similarity are well maintained.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)

自引率

0.00%

发文量