基于动量对比表征学习的非平行语音转换

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) Pub Date : 2022-11-07 DOI:10.23919/APSIPAASC55919.2022.9979937

Kotaro Onishi, Toru Nakashika

{"title":"基于动量对比表征学习的非平行语音转换","authors":"Kotaro Onishi, Toru Nakashika","doi":"10.23919/APSIPAASC55919.2022.9979937","DOIUrl":null,"url":null,"abstract":"Non-parallel voice conversion with deep neural net-works often disentangle speaker individuality and speech content. However, these methods rely on external models, text data, or implicit constraints for ways to disentangle. They may require learning other models or annotating text, or may not understand how latent representations are acquired. Therefore, we pro-pose voice conversion with momentum contrastive representation learning (MoCo V C), a method of explicitly adding constraints to intermediate features using contrastive representation learning, which is a self-supervised learning method. Using contrastive rep-resentation learning with transformations that preserve utterance content allows us to explicitly constrain the intermediate features to preserve utterance content. We present transformations used for contrastive representation learning that could be used for voice conversion and verify the effectiveness of each in an exper-iment. Moreover, MoCoVC demonstrates a high or comparable performance to the vector quantization constrained method in terms of both naturalness and speaker individuality in subjective evaluation experiments.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"124 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MoCoVC: Non-parallel Voice Conversion with Momentum Contrastive Representation Learning\",\"authors\":\"Kotaro Onishi, Toru Nakashika\",\"doi\":\"10.23919/APSIPAASC55919.2022.9979937\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Non-parallel voice conversion with deep neural net-works often disentangle speaker individuality and speech content. However, these methods rely on external models, text data, or implicit constraints for ways to disentangle. They may require learning other models or annotating text, or may not understand how latent representations are acquired. Therefore, we pro-pose voice conversion with momentum contrastive representation learning (MoCo V C), a method of explicitly adding constraints to intermediate features using contrastive representation learning, which is a self-supervised learning method. Using contrastive rep-resentation learning with transformations that preserve utterance content allows us to explicitly constrain the intermediate features to preserve utterance content. We present transformations used for contrastive representation learning that could be used for voice conversion and verify the effectiveness of each in an exper-iment. Moreover, MoCoVC demonstrates a high or comparable performance to the vector quantization constrained method in terms of both naturalness and speaker individuality in subjective evaluation experiments.\",\"PeriodicalId\":382967,\"journal\":{\"name\":\"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)\",\"volume\":\"124 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-11-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.23919/APSIPAASC55919.2022.9979937\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/APSIPAASC55919.2022.9979937","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

基于深度神经网络的非并行语音转换往往会将说话人的个性与语音内容分离开来。然而，这些方法依赖于外部模型、文本数据或隐式约束来解开纠缠。他们可能需要学习其他模型或注释文本，或者可能不理解如何获得潜在表征。因此，我们提出了基于动量对比表示学习的语音转换(MoCo V C)，这是一种利用对比表示学习显式地向中间特征添加约束的方法，是一种自监督学习方法。将对比表示学习与保留话语内容的转换结合使用，可以显式地约束中间特征以保留话语内容。我们提出了用于对比表示学习的转换，可用于语音转换，并在实验中验证了每种转换的有效性。此外，在主观评价实验中，MoCoVC在自然度和说话人个性方面都表现出与矢量量化约束方法相当的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

MoCoVC: Non-parallel Voice Conversion with Momentum Contrastive Representation Learning

Non-parallel voice conversion with deep neural net-works often disentangle speaker individuality and speech content. However, these methods rely on external models, text data, or implicit constraints for ways to disentangle. They may require learning other models or annotating text, or may not understand how latent representations are acquired. Therefore, we pro-pose voice conversion with momentum contrastive representation learning (MoCo V C), a method of explicitly adding constraints to intermediate features using contrastive representation learning, which is a self-supervised learning method. Using contrastive rep-resentation learning with transformations that preserve utterance content allows us to explicitly constrain the intermediate features to preserve utterance content. We present transformations used for contrastive representation learning that could be used for voice conversion and verify the effectiveness of each in an exper-iment. Moreover, MoCoVC demonstrates a high or comparable performance to the vector quantization constrained method in terms of both naturalness and speaker individuality in subjective evaluation experiments.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

自引率

0.00%

发文量