基于结构正则化层的自关注VAE增强零射多对多语音转换

2022 5th International Conference on Artificial Intelligence for Industries (AI4I) Pub Date : 2022-09-01 DOI:10.1109/AI4I54798.2022.00022

Ziang Long, Yunling Zheng, Meng Yu, Jack Xin

{"title":"基于结构正则化层的自关注VAE增强零射多对多语音转换","authors":"Ziang Long, Yunling Zheng, Meng Yu, Jack Xin","doi":"10.1109/AI4I54798.2022.00022","DOIUrl":null,"url":null,"abstract":"Variational auto-encoder (VAE) is an effective neural network architecture to disentangle a speech utterance into speaker identity and linguistic content latent embeddings, then generate an utterance for a target speaker from that of a source speaker. This is possible by concatenating the identity embedding of the target speaker and the content embedding of the source speaker uttering a desired sentence. In this work, we propose to improve VAE models with self-attention and structural regularization (RGSM). Specifically, we found a suitable location of VAE’s decoder to add a self-attention layer for incorporating non-local information in generating a converted utterance and hiding the source speaker’s identity. We applied relaxed groupwise splitting method (RGSM) to regularize network weights and remarkably enhance generalization performance. In experiments of zero-shot many-to-many voice conversion task on VCTK data set, with the self-attention layer and relaxed group-wise splitting method, our model achieves a gain of speaker classification accuracy on unseen speakers by 28.3% while slightly improved conversion voice quality in terms of MOSNet scores. Our encouraging findings point to future research on integrating more variety of attention structures in VAE framework while controlling model size and overfitting for advancing zero-shot many-to-many voice conversions1.1The work was partially supported by NSF grants DMS-1854434 and DMS-1952644 at UC Irvine.","PeriodicalId":345427,"journal":{"name":"2022 5th International Conference on Artificial Intelligence for Industries (AI4I)","volume":"128 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Enhancing Zero-Shot Many to Many Voice Conversion via Self-Attention VAE with Structurally Regularized Layers\",\"authors\":\"Ziang Long, Yunling Zheng, Meng Yu, Jack Xin\",\"doi\":\"10.1109/AI4I54798.2022.00022\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Variational auto-encoder (VAE) is an effective neural network architecture to disentangle a speech utterance into speaker identity and linguistic content latent embeddings, then generate an utterance for a target speaker from that of a source speaker. This is possible by concatenating the identity embedding of the target speaker and the content embedding of the source speaker uttering a desired sentence. In this work, we propose to improve VAE models with self-attention and structural regularization (RGSM). Specifically, we found a suitable location of VAE’s decoder to add a self-attention layer for incorporating non-local information in generating a converted utterance and hiding the source speaker’s identity. We applied relaxed groupwise splitting method (RGSM) to regularize network weights and remarkably enhance generalization performance. In experiments of zero-shot many-to-many voice conversion task on VCTK data set, with the self-attention layer and relaxed group-wise splitting method, our model achieves a gain of speaker classification accuracy on unseen speakers by 28.3% while slightly improved conversion voice quality in terms of MOSNet scores. Our encouraging findings point to future research on integrating more variety of attention structures in VAE framework while controlling model size and overfitting for advancing zero-shot many-to-many voice conversions1.1The work was partially supported by NSF grants DMS-1854434 and DMS-1952644 at UC Irvine.\",\"PeriodicalId\":345427,\"journal\":{\"name\":\"2022 5th International Conference on Artificial Intelligence for Industries (AI4I)\",\"volume\":\"128 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 5th International Conference on Artificial Intelligence for Industries (AI4I)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/AI4I54798.2022.00022\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 5th International Conference on Artificial Intelligence for Industries (AI4I)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AI4I54798.2022.00022","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

变分自编码器(VAE)是一种有效的神经网络结构，它将说话人的话语分解为说话人身份和语言内容的潜在嵌入，然后从源说话人的话语中生成目标说话人的话语。这可以通过连接目标说话人的身份嵌入和源说话人说出所需句子的内容嵌入来实现。在这项工作中，我们提出了用自注意和结构正则化(RGSM)改进VAE模型。具体来说，我们找到了VAE解码器的合适位置来添加一个自注意层，以便在生成转换后的话语和隐藏源说话人的身份时合并非局部信息。我们采用放松分组分裂方法(RGSM)对网络权值进行正则化，显著提高了泛化性能。在VCTK数据集上的零射击多对多语音转换任务实验中，我们的模型采用自注意层和宽松的分组划分方法，对未见过的说话人的说话人分类精度提高了28.3%，而转换语音质量在MOSNet得分方面略有提高。我们令人鼓舞的发现表明，未来的研究将在VAE框架中整合更多种类的注意力结构，同时控制模型大小和过拟合，以推进零枪击多对多语音转换。1.1这项工作得到了美国国家科学基金会DMS-1854434和加州大学欧文分校DMS-1952644的部分支持。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Enhancing Zero-Shot Many to Many Voice Conversion via Self-Attention VAE with Structurally Regularized Layers

Variational auto-encoder (VAE) is an effective neural network architecture to disentangle a speech utterance into speaker identity and linguistic content latent embeddings, then generate an utterance for a target speaker from that of a source speaker. This is possible by concatenating the identity embedding of the target speaker and the content embedding of the source speaker uttering a desired sentence. In this work, we propose to improve VAE models with self-attention and structural regularization (RGSM). Specifically, we found a suitable location of VAE’s decoder to add a self-attention layer for incorporating non-local information in generating a converted utterance and hiding the source speaker’s identity. We applied relaxed groupwise splitting method (RGSM) to regularize network weights and remarkably enhance generalization performance. In experiments of zero-shot many-to-many voice conversion task on VCTK data set, with the self-attention layer and relaxed group-wise splitting method, our model achieves a gain of speaker classification accuracy on unseen speakers by 28.3% while slightly improved conversion voice quality in terms of MOSNet scores. Our encouraging findings point to future research on integrating more variety of attention structures in VAE framework while controlling model size and overfitting for advancing zero-shot many-to-many voice conversions1.1The work was partially supported by NSF grants DMS-1854434 and DMS-1952644 at UC Irvine.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 5th International Conference on Artificial Intelligence for Industries (AI4I)

自引率

0.00%

发文量