Patrick Lumban Tobing, Tomoki Hayashi, Yi-Chiao Wu, Kazuhiro Kobayashi, T. Toda
{"title":"语音转换中深度光谱映射和波网声码器的评价","authors":"Patrick Lumban Tobing, Tomoki Hayashi, Yi-Chiao Wu, Kazuhiro Kobayashi, T. Toda","doi":"10.1109/SLT.2018.8639608","DOIUrl":null,"url":null,"abstract":"This paper presents an evaluation of deep spectral mapping and WaveNet vocoder in voice conversion (VC). In our VC framework, spectral features of an input speaker are converted into those of a target speaker using the deep spectral mapping, and then together with the excitation features, the converted waveform is generated using WaveNet vocoder. In this work, we compare three different deep spectral mapping networks, i.e., a deep single density network (DSDN), a deep mixture density network (DMDN), and a long short-term memory recurrent neural network with an autoregressive output layer (LSTM-AR). Moreover, we also investigate several methods for reducing mismatches of spectral features used in WaveNet vocoder between training and conversion processes, such as some methods to alleviate oversmoothing effects of the converted spectral features, and another method to refine WaveNet using the converted spectral features. The experimental results demonstrate that the LSTM-AR yields nearly better spectral mapping accuracy than the others, and the proposed WaveNet refinement method significantly improves the naturalness of the converted waveform.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"An Evaluation of Deep Spectral Mappings and WaveNet Vocoder for Voice Conversion\",\"authors\":\"Patrick Lumban Tobing, Tomoki Hayashi, Yi-Chiao Wu, Kazuhiro Kobayashi, T. Toda\",\"doi\":\"10.1109/SLT.2018.8639608\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents an evaluation of deep spectral mapping and WaveNet vocoder in voice conversion (VC). In our VC framework, spectral features of an input speaker are converted into those of a target speaker using the deep spectral mapping, and then together with the excitation features, the converted waveform is generated using WaveNet vocoder. In this work, we compare three different deep spectral mapping networks, i.e., a deep single density network (DSDN), a deep mixture density network (DMDN), and a long short-term memory recurrent neural network with an autoregressive output layer (LSTM-AR). Moreover, we also investigate several methods for reducing mismatches of spectral features used in WaveNet vocoder between training and conversion processes, such as some methods to alleviate oversmoothing effects of the converted spectral features, and another method to refine WaveNet using the converted spectral features. The experimental results demonstrate that the LSTM-AR yields nearly better spectral mapping accuracy than the others, and the proposed WaveNet refinement method significantly improves the naturalness of the converted waveform.\",\"PeriodicalId\":377307,\"journal\":{\"name\":\"2018 IEEE Spoken Language Technology Workshop (SLT)\",\"volume\":\"35 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE Spoken Language Technology Workshop (SLT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SLT.2018.8639608\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE Spoken Language Technology Workshop (SLT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SLT.2018.8639608","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
An Evaluation of Deep Spectral Mappings and WaveNet Vocoder for Voice Conversion
This paper presents an evaluation of deep spectral mapping and WaveNet vocoder in voice conversion (VC). In our VC framework, spectral features of an input speaker are converted into those of a target speaker using the deep spectral mapping, and then together with the excitation features, the converted waveform is generated using WaveNet vocoder. In this work, we compare three different deep spectral mapping networks, i.e., a deep single density network (DSDN), a deep mixture density network (DMDN), and a long short-term memory recurrent neural network with an autoregressive output layer (LSTM-AR). Moreover, we also investigate several methods for reducing mismatches of spectral features used in WaveNet vocoder between training and conversion processes, such as some methods to alleviate oversmoothing effects of the converted spectral features, and another method to refine WaveNet using the converted spectral features. The experimental results demonstrate that the LSTM-AR yields nearly better spectral mapping accuracy than the others, and the proposed WaveNet refinement method significantly improves the naturalness of the converted waveform.