W2VC: WavLM representation based one-shot voice conversion with gradient reversal distillation and CTC supervision

IF 1.9 3区计算机科学

Journal on Audio Speech and Music Processing Pub Date : 2023-10-28 DOI:10.1186/s13636-023-00312-8

Hao Huang, Lin Wang, Jichen Yang, Ying Hu, Liang He

{"title":"W2VC: WavLM representation based one-shot voice conversion with gradient reversal distillation and CTC supervision","authors":"Hao Huang, Lin Wang, Jichen Yang, Ying Hu, Liang He","doi":"10.1186/s13636-023-00312-8","DOIUrl":null,"url":null,"abstract":"Abstract Non-parallel data voice conversion (VC) has achieved considerable breakthroughs due to self-supervised pre-trained representation (SSPR) being used in recent years. Features extracted by the pre-trained model are expected to contain more content information. However, in common VC with SSPR, there is no special implementation to remove speaker information in the content representation extraction by SSPR, which prevents further purification of the speaker information from SSPR representation. Moreover, in conventional VC, Mel-spectrogram is often selected as the reconstructed acoustic feature, which is not consistent with the input of the content encoder and results in some information lost. Motivated by the above, we proposed W2VC to settle the issues. W2VC consists of three parts: (1) We reconstruct feature from WavLM representation (WLMR) that is more consistent with the input of content encoder; (2) Connectionist temporal classification (CTC) is used to align content representation and text context from phoneme level, content encoder plus gradient reversal layer (GRL) based speaker classifier are used to remove speaker information in the content representation extraction; (3) WLMR-based HiFi-GAN is trained to convert WLMR to waveform speech. VC experimental results show that GRL can purify well the content information of the self-supervised model. The GRL purification and CTC supervision on the content encoder are complementary in improving the VC performance. Moreover, the synthesized speech using the WLMR retrained vocoder achieves better results in both subjective and objective evaluation. The proposed method is evaluated on the VCTK and CMU databases. It is shown the method achieves 8.901 in objective MCD, 4.45 in speech naturalness, and 3.62 in speaker similarity of subjective MOS score, which is superior to the baseline.","PeriodicalId":49309,"journal":{"name":"Journal on Audio Speech and Music Processing","volume":"1 1","pages":"0"},"PeriodicalIF":1.9000,"publicationDate":"2023-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal on Audio Speech and Music Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1186/s13636-023-00312-8","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Abstract Non-parallel data voice conversion (VC) has achieved considerable breakthroughs due to self-supervised pre-trained representation (SSPR) being used in recent years. Features extracted by the pre-trained model are expected to contain more content information. However, in common VC with SSPR, there is no special implementation to remove speaker information in the content representation extraction by SSPR, which prevents further purification of the speaker information from SSPR representation. Moreover, in conventional VC, Mel-spectrogram is often selected as the reconstructed acoustic feature, which is not consistent with the input of the content encoder and results in some information lost. Motivated by the above, we proposed W2VC to settle the issues. W2VC consists of three parts: (1) We reconstruct feature from WavLM representation (WLMR) that is more consistent with the input of content encoder; (2) Connectionist temporal classification (CTC) is used to align content representation and text context from phoneme level, content encoder plus gradient reversal layer (GRL) based speaker classifier are used to remove speaker information in the content representation extraction; (3) WLMR-based HiFi-GAN is trained to convert WLMR to waveform speech. VC experimental results show that GRL can purify well the content information of the self-supervised model. The GRL purification and CTC supervision on the content encoder are complementary in improving the VC performance. Moreover, the synthesized speech using the WLMR retrained vocoder achieves better results in both subjective and objective evaluation. The proposed method is evaluated on the VCTK and CMU databases. It is shown the method achieves 8.901 in objective MCD, 4.45 in speech naturalness, and 3.62 in speaker similarity of subjective MOS score, which is superior to the baseline.

查看原文本刊更多论文

W2VC:基于梯度反转蒸馏和CTC监督的WavLM表示的一次性语音转换

近年来，由于自监督预训练表示(SSPR)的应用，非并行数据语音转换(VC)取得了相当大的突破。通过预训练模型提取的特征被期望包含更多的内容信息。然而，在常用的带有SSPR的VC中，没有专门的实现来去除SSPR提取的内容表示中的说话人信息，这阻碍了从SSPR表示中进一步纯化说话人信息。此外，在传统的VC中，通常选择mel -谱图作为重构的声学特征，这与内容编码器的输入不一致，导致部分信息丢失。基于以上原因，我们提出了W2VC来解决这些问题。W2VC由三部分组成:(1)从WavLM表示(WLMR)中重构出与内容编码器输入更一致的特征;(2)使用连接时态分类(CTC)从音素层面对内容表示和文本上下文进行对齐，在内容表示提取中使用内容编码器加基于梯度反转层(GRL)的说话人分类器去除说话人信息;(3)训练基于WLMR的HiFi-GAN将WLMR转换为波形语音。VC实验结果表明，GRL能够很好地净化自监督模型的内容信息。对内容编码器进行GRL净化和CTC监督是提高VC性能的互补措施。此外，使用WLMR再训练声码器合成的语音在主观和客观评价上都取得了更好的效果。在VCTK和CMU数据库上对该方法进行了验证。结果表明，该方法的客观MCD得分为8.901，语音自然度得分为4.45，主观MOS得分为3.62，均优于基线。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal on Audio Speech and Music Processing Engineering-Electrical and Electronic Engineering

CiteScore

4.10

自引率

4.20%

发文量

期刊介绍： The aim of “EURASIP Journal on Audio, Speech, and Music Processing” is to bring together researchers, scientists and engineers working on the theory and applications of the processing of various audio signals, with a specific focus on speech and music. EURASIP Journal on Audio, Speech, and Music Processing will be an interdisciplinary journal for the dissemination of all basic and applied aspects of speech communication and audio processes.