CLESSR-VC：用于单次语音转换的对比学习增强型自监督表示法

IF 3 3区计算机科学 Q2 ACOUSTICS

Speech Communication Pub Date : 2024-09-10 DOI:10.1016/j.specom.2024.103139

Yuhang Xue, Ning Chen, Yixin Luo, Hongqing Zhu, Zhiying Zhu

{"title":"CLESSR-VC：用于单次语音转换的对比学习增强型自监督表示法","authors":"Yuhang Xue, Ning Chen, Yixin Luo, Hongqing Zhu, Zhiying Zhu","doi":"10.1016/j.specom.2024.103139","DOIUrl":null,"url":null,"abstract":"<div><p>One-shot voice conversion (VC) has attracted more and more attention due to its broad prospects for practical application. In this task, the representation ability of speech features and the model’s generalization are the focus of attention. This paper proposes a model called CLESSR-VC, which enhances pre-trained self-supervised learning (SSL) representations through contrastive learning for one-shot VC. First, SSL features from the 23rd and 9th layers of the pre-trained WavLM are adopted to extract content embedding and SSL speaker embedding, respectively, to ensure the model’s generalization. Then, the conventional acoustic feature mel-spectrograms and contrastive learning are introduced to enhance the representation ability of speech features. Specifically, contrastive learning combined with the pitch-shift augmentation method is applied to disentangle content information from SSL features accurately. Mel-spectrograms are adopted to extract mel speaker embedding. The AM-Softmax and cross-architecture contrastive learning are applied between SSL and mel speaker embeddings to obtain the fused speaker embedding that helps improve speech quality and speaker similarity. Both objective and subjective evaluation results on the VCTK corpus confirm that the proposed VC model has outstanding performance and few trainable parameters.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"165 ","pages":"Article 103139"},"PeriodicalIF":3.0000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CLESSR-VC: Contrastive learning enhanced self-supervised representations for one-shot voice conversion\",\"authors\":\"Yuhang Xue, Ning Chen, Yixin Luo, Hongqing Zhu, Zhiying Zhu\",\"doi\":\"10.1016/j.specom.2024.103139\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>One-shot voice conversion (VC) has attracted more and more attention due to its broad prospects for practical application. In this task, the representation ability of speech features and the model’s generalization are the focus of attention. This paper proposes a model called CLESSR-VC, which enhances pre-trained self-supervised learning (SSL) representations through contrastive learning for one-shot VC. First, SSL features from the 23rd and 9th layers of the pre-trained WavLM are adopted to extract content embedding and SSL speaker embedding, respectively, to ensure the model’s generalization. Then, the conventional acoustic feature mel-spectrograms and contrastive learning are introduced to enhance the representation ability of speech features. Specifically, contrastive learning combined with the pitch-shift augmentation method is applied to disentangle content information from SSL features accurately. Mel-spectrograms are adopted to extract mel speaker embedding. The AM-Softmax and cross-architecture contrastive learning are applied between SSL and mel speaker embeddings to obtain the fused speaker embedding that helps improve speech quality and speaker similarity. Both objective and subjective evaluation results on the VCTK corpus confirm that the proposed VC model has outstanding performance and few trainable parameters.</p></div>\",\"PeriodicalId\":49485,\"journal\":{\"name\":\"Speech Communication\",\"volume\":\"165 \",\"pages\":\"Article 103139\"},\"PeriodicalIF\":3.0000,\"publicationDate\":\"2024-09-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Speech Communication\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167639324001109\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639324001109","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

摘要

单次语音转换（VC）因其广阔的实际应用前景而受到越来越多的关注。在这项任务中，语音特征的表示能力和模型的泛化能力是关注的焦点。本文提出了一种名为 CLESSR-VC 的模型，该模型通过对比学习增强了预训练的自监督学习（SSL）表征，可用于单次 VC。首先，采用预训练 WavLM 第 23 层和第 9 层的 SSL 特征，分别提取内容嵌入和 SSL 说话者嵌入，以确保模型的泛化。然后，引入传统的声学特征 mel-spectrograms 和对比学习来增强语音特征的表示能力。具体来说，对比学习与音高偏移增强方法相结合，可以准确地从 SSL 特征中分离出内容信息。采用梅尔频谱图提取梅尔说话者嵌入。在 SSL 和 mel 说话者嵌入之间应用 AM-Softmax 和跨架构对比学习，以获得融合的说话者嵌入，这有助于提高语音质量和说话者相似度。在 VCTK 语料库上进行的客观和主观评估结果都证实，所提出的 VC 模型具有出色的性能和较少的可训练参数。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

CLESSR-VC: Contrastive learning enhanced self-supervised representations for one-shot voice conversion

One-shot voice conversion (VC) has attracted more and more attention due to its broad prospects for practical application. In this task, the representation ability of speech features and the model’s generalization are the focus of attention. This paper proposes a model called CLESSR-VC, which enhances pre-trained self-supervised learning (SSL) representations through contrastive learning for one-shot VC. First, SSL features from the 23rd and 9th layers of the pre-trained WavLM are adopted to extract content embedding and SSL speaker embedding, respectively, to ensure the model’s generalization. Then, the conventional acoustic feature mel-spectrograms and contrastive learning are introduced to enhance the representation ability of speech features. Specifically, contrastive learning combined with the pitch-shift augmentation method is applied to disentangle content information from SSL features accurately. Mel-spectrograms are adopted to extract mel speaker embedding. The AM-Softmax and cross-architecture contrastive learning are applied between SSL and mel speaker embeddings to obtain the fused speaker embedding that helps improve speech quality and speaker similarity. Both objective and subjective evaluation results on the VCTK corpus confirm that the proposed VC model has outstanding performance and few trainable parameters.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Speech Communication 工程技术-计算机：跨学科应用

CiteScore

6.80

自引率

6.20%

发文量

审稿时长

19.2 weeks

期刊介绍： Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results. The journal''s primary objectives are: • to present a forum for the advancement of human and human-machine speech communication science; • to stimulate cross-fertilization between different fields of this domain; • to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.