An evaluation of voice conversion with neural network spectral mapping models and WaveNet vocoder

IF 3.2 Q1 Computer Science
Patrick Lumban Tobing, Yi-Chiao Wu, Tomoki Hayashi, Kazuhiro Kobayashi, T. Toda
{"title":"An evaluation of voice conversion with neural network spectral mapping models and WaveNet vocoder","authors":"Patrick Lumban Tobing, Yi-Chiao Wu, Tomoki Hayashi, Kazuhiro Kobayashi, T. Toda","doi":"10.1017/ATSIP.2020.24","DOIUrl":null,"url":null,"abstract":"This paper presents an evaluation of parallel voice conversion (VC) with neural network (NN)-based statistical models for spectral mapping and waveform generation. The NN-based architectures for spectral mapping include deep NN (DNN), deep mixture density network (DMDN), and recurrent NN (RNN) models. WaveNet (WN) vocoder is employed as a high-quality NN-based waveform generation. In VC, though, owing to the oversmoothed characteristics of estimated speech parameters, quality degradation still occurs. To address this problem, we utilize post-conversion for the converted features based on direct waveform modifferential and global variance postfilter. To preserve the consistency with the post-conversion, we further propose a spectrum differential loss for the spectral modeling. The experimental results demonstrate that: (1) the RNN-based spectral modeling achieves higher accuracy with a faster convergence rate and better generalization compared to the DNN-/DMDN-based models; (2) the RNN-based spectral modeling is also capable of producing less oversmoothed spectral trajectory; (3) the use of proposed spectrum differential loss improves the performance in the same-gender conversions; and (4) the proposed post-conversion on converted features for the WN vocoder in VC yields the best performance in both naturalness and speaker similarity compared to the conventional use of WN vocoder.","PeriodicalId":44812,"journal":{"name":"APSIPA Transactions on Signal and Information Processing","volume":null,"pages":null},"PeriodicalIF":3.2000,"publicationDate":"2020-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1017/ATSIP.2020.24","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"APSIPA Transactions on Signal and Information Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1017/ATSIP.2020.24","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Computer Science","Score":null,"Total":0}
引用次数: 1

Abstract

This paper presents an evaluation of parallel voice conversion (VC) with neural network (NN)-based statistical models for spectral mapping and waveform generation. The NN-based architectures for spectral mapping include deep NN (DNN), deep mixture density network (DMDN), and recurrent NN (RNN) models. WaveNet (WN) vocoder is employed as a high-quality NN-based waveform generation. In VC, though, owing to the oversmoothed characteristics of estimated speech parameters, quality degradation still occurs. To address this problem, we utilize post-conversion for the converted features based on direct waveform modifferential and global variance postfilter. To preserve the consistency with the post-conversion, we further propose a spectrum differential loss for the spectral modeling. The experimental results demonstrate that: (1) the RNN-based spectral modeling achieves higher accuracy with a faster convergence rate and better generalization compared to the DNN-/DMDN-based models; (2) the RNN-based spectral modeling is also capable of producing less oversmoothed spectral trajectory; (3) the use of proposed spectrum differential loss improves the performance in the same-gender conversions; and (4) the proposed post-conversion on converted features for the WN vocoder in VC yields the best performance in both naturalness and speaker similarity compared to the conventional use of WN vocoder.
用神经网络频谱映射模型和WaveNet声码器评估语音转换
本文用基于神经网络的统计模型对并行语音转换(VC)的频谱映射和波形生成进行了评估。用于频谱映射的基于神经网络的架构包括深度神经网络(DNN)、深度混合密度网络(DMDN)和递归神经网络(RNN)模型。WaveNet(WN)声码器被用作高质量的基于NN的波形生成。然而,在VC中,由于估计的语音参数的过度平滑特性,质量仍然会下降。为了解决这个问题,我们基于直接波形修正和全局方差后滤波器,对转换后的特征进行后转换。为了保持与后转换的一致性,我们进一步提出了用于频谱建模的频谱微分损失。实验结果表明:(1)与基于DNN-/DDN的模型相比,基于RNN的光谱建模具有更高的精度、更快的收敛速度和更好的泛化能力;(2) 基于RNN的频谱建模也能够产生较少过平滑的频谱轨迹;(3) 所提出的频谱差分损耗的使用提高了同性别转换的性能;以及(4)与WN声码器的传统使用相比,在VC中对WN声代码器的转换特征提出的后转换在自然度和扬声器相似性方面产生最佳性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
APSIPA Transactions on Signal and Information Processing
APSIPA Transactions on Signal and Information Processing ENGINEERING, ELECTRICAL & ELECTRONIC-
CiteScore
8.60
自引率
6.20%
发文量
30
审稿时长
40 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信