Non-parallel Sequence-to-Sequence Voice Conversion for Arbitrary Speakers

2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2021-01-24 DOI:10.1109/ISCSLP49672.2021.9362095

Ying Zhang, Hao Che, Xiaorui Wang

{"title":"Non-parallel Sequence-to-Sequence Voice Conversion for Arbitrary Speakers","authors":"Ying Zhang, Hao Che, Xiaorui Wang","doi":"10.1109/ISCSLP49672.2021.9362095","DOIUrl":null,"url":null,"abstract":"Voice conversion (VC) aims to modify the speaker’s tone while preserving the linguistic information. Recent works show that voice conversion has made great progress on non-parallel data by introducing phonetic posteriorgrams (PPGs). However, once the prosody of source and target speaker differ significantly, it causes noticeable quality degradation of the converted speech. To alleviate the impact of the prosody of the source speaker, we propose a sequence-to-sequence voice conversion (Seq2SeqVC) method, which utilizes connectionist temporal classification PPGs (CTC-PPGs) as inputs and models the non-linear length mapping between CTC-PPGs and frame-level acoustic features. CTC-PPGs are extracted by the CTC based automatic speech recognition (CTC-ASR) model and used to replace time-aligned PPGs. The blank token is introduced in CTC-ASR outputs to identify fewer information frames and get around consecutive repeating characters. After removing blank tokens, the left CTC-PPGs only contain linguistic information, and the phone duration information of the source speech is removed. Thus, phone durations of the converted speech are more faithful to the target speaker, which means higher similarity to the target and less interference from different source speakers. Experimental results show our Seq2Seq-VC model achieves higher scores in similarity and naturalness tests than the baseline method. What’s more, we expand our seq2seqVC approach to voice conversion towards arbitrary speakers with limited data. The experimental results demonstrate that our Seq2Seq-VC model can transfer to a new speaker using 100 utterances (about 5 minutes).","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISCSLP49672.2021.9362095","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Voice conversion (VC) aims to modify the speaker’s tone while preserving the linguistic information. Recent works show that voice conversion has made great progress on non-parallel data by introducing phonetic posteriorgrams (PPGs). However, once the prosody of source and target speaker differ significantly, it causes noticeable quality degradation of the converted speech. To alleviate the impact of the prosody of the source speaker, we propose a sequence-to-sequence voice conversion (Seq2SeqVC) method, which utilizes connectionist temporal classification PPGs (CTC-PPGs) as inputs and models the non-linear length mapping between CTC-PPGs and frame-level acoustic features. CTC-PPGs are extracted by the CTC based automatic speech recognition (CTC-ASR) model and used to replace time-aligned PPGs. The blank token is introduced in CTC-ASR outputs to identify fewer information frames and get around consecutive repeating characters. After removing blank tokens, the left CTC-PPGs only contain linguistic information, and the phone duration information of the source speech is removed. Thus, phone durations of the converted speech are more faithful to the target speaker, which means higher similarity to the target and less interference from different source speakers. Experimental results show our Seq2Seq-VC model achieves higher scores in similarity and naturalness tests than the baseline method. What’s more, we expand our seq2seqVC approach to voice conversion towards arbitrary speakers with limited data. The experimental results demonstrate that our Seq2Seq-VC model can transfer to a new speaker using 100 utterances (about 5 minutes).

查看原文本刊更多论文

任意说话者的非并行序列到序列语音转换

语音转换的目的是在保留语言信息的前提下对说话人的语气进行修改。近年来的研究表明，语音转换通过引入语音后图(PPGs)在非并行数据上取得了很大进展。然而，一旦源语和目标语的韵律差异很大，就会导致转换后的语音质量明显下降。为了减轻源说话者韵律的影响，我们提出了一种序列到序列语音转换(Seq2SeqVC)方法，该方法利用连接主义时间分类ppg (CTC-PPGs)作为输入，并对CTC-PPGs与帧级声学特征之间的非线性长度映射进行建模。基于CTC的自动语音识别(CTC- asr)模型提取了CTC-PPGs，并用于替换时间对齐的PPGs。在CTC-ASR输出中引入空白标记以识别较少的信息帧并绕过连续重复的字符。删除空白标记后，左侧的ctc - ppg只包含语言信息，删除源语音的通话时长信息。因此，转换后的语音的电话持续时间对目标说话人更忠实，这意味着与目标说话人的相似度更高，来自不同源说话人的干扰更少。实验结果表明，我们的Seq2Seq-VC模型在相似度和自然度测试中取得了比基线方法更高的分数。此外，我们将seq2seqVC方法扩展到具有有限数据的任意说话者的语音转换。实验结果表明，我们的Seq2Seq-VC模型可以用100个话语(约5分钟)转移到一个新的说话者身上。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)

自引率

0.00%

发文量