基于语音后图的跨语言语音转换的领域适应和语言条件调节

Pin-Chieh Hsu, N. Minematsu, D. Saito
{"title":"基于语音后图的跨语言语音转换的领域适应和语言条件调节","authors":"Pin-Chieh Hsu, N. Minematsu, D. Saito","doi":"10.23919/APSIPAASC55919.2022.9979918","DOIUrl":null,"url":null,"abstract":"In this work, we examine two methods for im-proving phonetic posteriorgram (PPG) based cross-lingual voice conversion (CLV C). Previous research usually utilized a speaker encoder to characterize speakers' identity; however, the speaker embedding learned by the previous model tends to be language- dependent, degrading the performance of converted speeches. Therefore, we propose using the technique of domain-adversarial training. With this approach, the speaker embedding in different languages can be adapted into the same distribution to form a language-independent speaker embedding space. The other approach we propose is to employ external language conditioning to support our model to disentangle the language information from the speaker embedding. In our experiments, both methods are evaluated on a Japanese-English bilingual database. Besides subjective evaluation, two automatic objective assessment systems are adopted to assess the quality and speaker similarity of converted utterances. According to the experimental results, the two proposed methods can generate speaker embedding with reduced language dependency and improve the naturalness and speaker similarity of converted speeches.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"103 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Domain Adaptation and Language Conditioning to Improve Phonetic Posteriorgram Based Cross-Lingual Voice Conversion\",\"authors\":\"Pin-Chieh Hsu, N. Minematsu, D. Saito\",\"doi\":\"10.23919/APSIPAASC55919.2022.9979918\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this work, we examine two methods for im-proving phonetic posteriorgram (PPG) based cross-lingual voice conversion (CLV C). Previous research usually utilized a speaker encoder to characterize speakers' identity; however, the speaker embedding learned by the previous model tends to be language- dependent, degrading the performance of converted speeches. Therefore, we propose using the technique of domain-adversarial training. With this approach, the speaker embedding in different languages can be adapted into the same distribution to form a language-independent speaker embedding space. The other approach we propose is to employ external language conditioning to support our model to disentangle the language information from the speaker embedding. In our experiments, both methods are evaluated on a Japanese-English bilingual database. Besides subjective evaluation, two automatic objective assessment systems are adopted to assess the quality and speaker similarity of converted utterances. According to the experimental results, the two proposed methods can generate speaker embedding with reduced language dependency and improve the naturalness and speaker similarity of converted speeches.\",\"PeriodicalId\":382967,\"journal\":{\"name\":\"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)\",\"volume\":\"103 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-11-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.23919/APSIPAASC55919.2022.9979918\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/APSIPAASC55919.2022.9979918","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

在这项工作中,我们研究了两种改进基于语音后图(PPG)的跨语言语音转换(CLV C)的方法。先前的研究通常使用说话人编码器来表征说话人的身份;然而,先前的模型学习到的说话人嵌入倾向于语言依赖,降低了转换后的演讲的性能。因此,我们建议使用领域对抗训练技术。利用该方法,可以将不同语言的说话人嵌入到相同的分布中,形成与语言无关的说话人嵌入空间。我们提出的另一种方法是使用外部语言条件反射来支持我们的模型,以从说话人嵌入中分离语言信息。在我们的实验中,两种方法都在一个日英双语数据库上进行了评估。在主观评价的基础上,采用两套自动客观评价系统对转换后的话语质量和说话人相似度进行评价。实验结果表明,所提出的两种方法均能生成语言依赖性较低的说话人嵌入,提高转换后的语音的自然度和说话人相似度。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Domain Adaptation and Language Conditioning to Improve Phonetic Posteriorgram Based Cross-Lingual Voice Conversion
In this work, we examine two methods for im-proving phonetic posteriorgram (PPG) based cross-lingual voice conversion (CLV C). Previous research usually utilized a speaker encoder to characterize speakers' identity; however, the speaker embedding learned by the previous model tends to be language- dependent, degrading the performance of converted speeches. Therefore, we propose using the technique of domain-adversarial training. With this approach, the speaker embedding in different languages can be adapted into the same distribution to form a language-independent speaker embedding space. The other approach we propose is to employ external language conditioning to support our model to disentangle the language information from the speaker embedding. In our experiments, both methods are evaluated on a Japanese-English bilingual database. Besides subjective evaluation, two automatic objective assessment systems are adopted to assess the quality and speaker similarity of converted utterances. According to the experimental results, the two proposed methods can generate speaker embedding with reduced language dependency and improve the naturalness and speaker similarity of converted speeches.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信