An evaluation of voice conversion with neural network spectral mapping models and WaveNet vocoder

IF 3.2 Q1 Computer Science

APSIPA Transactions on Signal and Information Processing Pub Date : 2020-11-25 DOI:10.1017/ATSIP.2020.24

Patrick Lumban Tobing, Yi-Chiao Wu, Tomoki Hayashi, Kazuhiro Kobayashi, T. Toda

{"title":"An evaluation of voice conversion with neural network spectral mapping models and WaveNet vocoder","authors":"Patrick Lumban Tobing, Yi-Chiao Wu, Tomoki Hayashi, Kazuhiro Kobayashi, T. Toda","doi":"10.1017/ATSIP.2020.24","DOIUrl":null,"url":null,"abstract":"This paper presents an evaluation of parallel voice conversion (VC) with neural network (NN)-based statistical models for spectral mapping and waveform generation. The NN-based architectures for spectral mapping include deep NN (DNN), deep mixture density network (DMDN), and recurrent NN (RNN) models. WaveNet (WN) vocoder is employed as a high-quality NN-based waveform generation. In VC, though, owing to the oversmoothed characteristics of estimated speech parameters, quality degradation still occurs. To address this problem, we utilize post-conversion for the converted features based on direct waveform modifferential and global variance postfilter. To preserve the consistency with the post-conversion, we further propose a spectrum differential loss for the spectral modeling. The experimental results demonstrate that: (1) the RNN-based spectral modeling achieves higher accuracy with a faster convergence rate and better generalization compared to the DNN-/DMDN-based models; (2) the RNN-based spectral modeling is also capable of producing less oversmoothed spectral trajectory; (3) the use of proposed spectrum differential loss improves the performance in the same-gender conversions; and (4) the proposed post-conversion on converted features for the WN vocoder in VC yields the best performance in both naturalness and speaker similarity compared to the conventional use of WN vocoder.","PeriodicalId":44812,"journal":{"name":"APSIPA Transactions on Signal and Information Processing","volume":" ","pages":""},"PeriodicalIF":3.2000,"publicationDate":"2020-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1017/ATSIP.2020.24","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"APSIPA Transactions on Signal and Information Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1017/ATSIP.2020.24","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Computer Science","Score":null,"Total":0}

引用次数: 1

Abstract

This paper presents an evaluation of parallel voice conversion (VC) with neural network (NN)-based statistical models for spectral mapping and waveform generation. The NN-based architectures for spectral mapping include deep NN (DNN), deep mixture density network (DMDN), and recurrent NN (RNN) models. WaveNet (WN) vocoder is employed as a high-quality NN-based waveform generation. In VC, though, owing to the oversmoothed characteristics of estimated speech parameters, quality degradation still occurs. To address this problem, we utilize post-conversion for the converted features based on direct waveform modifferential and global variance postfilter. To preserve the consistency with the post-conversion, we further propose a spectrum differential loss for the spectral modeling. The experimental results demonstrate that: (1) the RNN-based spectral modeling achieves higher accuracy with a faster convergence rate and better generalization compared to the DNN-/DMDN-based models; (2) the RNN-based spectral modeling is also capable of producing less oversmoothed spectral trajectory; (3) the use of proposed spectrum differential loss improves the performance in the same-gender conversions; and (4) the proposed post-conversion on converted features for the WN vocoder in VC yields the best performance in both naturalness and speaker similarity compared to the conventional use of WN vocoder.

查看原文本刊更多论文

用神经网络频谱映射模型和WaveNet声码器评估语音转换

本文用基于神经网络的统计模型对并行语音转换（VC）的频谱映射和波形生成进行了评估。用于频谱映射的基于神经网络的架构包括深度神经网络（DNN）、深度混合密度网络（DMDN）和递归神经网络（RNN）模型。WaveNet（WN）声码器被用作高质量的基于NN的波形生成。然而，在VC中，由于估计的语音参数的过度平滑特性，质量仍然会下降。为了解决这个问题，我们基于直接波形修正和全局方差后滤波器，对转换后的特征进行后转换。为了保持与后转换的一致性，我们进一步提出了用于频谱建模的频谱微分损失。实验结果表明：（1）与基于DNN-/DDN的模型相比，基于RNN的光谱建模具有更高的精度、更快的收敛速度和更好的泛化能力；（2）基于RNN的频谱建模也能够产生较少过平滑的频谱轨迹；（3）所提出的频谱差分损耗的使用提高了同性别转换的性能；以及（4）与WN声码器的传统使用相比，在VC中对WN声代码器的转换特征提出的后转换在自然度和扬声器相似性方面产生最佳性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊