基于小波变换的不同时间尺度F0神经网络情绪语音转换

Speech Synthesis Workshop Pub Date : 2016-09-13 DOI:10.21437/SSW.2016-23

Zhaojie Luo, T. Takiguchi, Y. Ariki

{"title":"基于小波变换的不同时间尺度F0神经网络情绪语音转换","authors":"Zhaojie Luo, T. Takiguchi, Y. Ariki","doi":"10.21437/SSW.2016-23","DOIUrl":null,"url":null,"abstract":"An artiﬁcial neural network is one of the most important models for training features of voice conversion (VC) tasks. Typically, neural networks (NNs) are very effective in processing nonlinear features, such as mel cepstral coefﬁcients (MCC) which represent the spectrum features. However, a simple representation for fundamental frequency (F0) is not enough for neural networks to deal with an emotional voice, because the time sequence of F0 for an emotional voice changes drastically. There-fore, in this paper, we propose an effective method that uses the continuous wavelet transform (CWT) to decompose F0 into different temporal scales that can be well trained by NNs for prosody modeling in emotional voice conversion. Meanwhile, the proposed method uses deep belief networks (DBNs) to pre-train the NNs that convert spectral features. By utilizing these approaches, the proposed method can change the spectrum and the prosody for an emotional voice at the same time, and was able to outperform other state-of-the-art methods for emotional voice conversion.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":"{\"title\":\"Emotional Voice Conversion Using Neural Networks with Different Temporal Scales of F0 based on Wavelet Transform\",\"authors\":\"Zhaojie Luo, T. Takiguchi, Y. Ariki\",\"doi\":\"10.21437/SSW.2016-23\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"An artiﬁcial neural network is one of the most important models for training features of voice conversion (VC) tasks. Typically, neural networks (NNs) are very effective in processing nonlinear features, such as mel cepstral coefﬁcients (MCC) which represent the spectrum features. However, a simple representation for fundamental frequency (F0) is not enough for neural networks to deal with an emotional voice, because the time sequence of F0 for an emotional voice changes drastically. There-fore, in this paper, we propose an effective method that uses the continuous wavelet transform (CWT) to decompose F0 into different temporal scales that can be well trained by NNs for prosody modeling in emotional voice conversion. Meanwhile, the proposed method uses deep belief networks (DBNs) to pre-train the NNs that convert spectral features. By utilizing these approaches, the proposed method can change the spectrum and the prosody for an emotional voice at the same time, and was able to outperform other state-of-the-art methods for emotional voice conversion.\",\"PeriodicalId\":340820,\"journal\":{\"name\":\"Speech Synthesis Workshop\",\"volume\":\"4 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-09-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"13\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Speech Synthesis Workshop\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21437/SSW.2016-23\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Synthesis Workshop","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/SSW.2016-23","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

摘要

人工神经网络是训练语音转换任务特征的重要模型之一。通常，神经网络(NNs)在处理非线性特征方面非常有效，例如表示频谱特征的梅尔倒谱系数(MCC)。然而，对于神经网络来说，一个简单的基频(F0)表示是不足以处理情感语音的，因为情感语音的F0时间序列变化很大。因此，本文提出了一种有效的方法，即利用连续小波变换(CWT)将F0分解为不同的时间尺度，这些时间尺度可以被神经网络很好地训练，用于情绪语音转换中的韵律建模。同时，该方法利用深度信念网络(dbn)对转换频谱特征的神经网络进行预训练。利用这些方法，所提出的方法可以同时改变情感语音的频谱和韵律，并且能够优于其他最先进的情感语音转换方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Emotional Voice Conversion Using Neural Networks with Different Temporal Scales of F0 based on Wavelet Transform

An artiﬁcial neural network is one of the most important models for training features of voice conversion (VC) tasks. Typically, neural networks (NNs) are very effective in processing nonlinear features, such as mel cepstral coefﬁcients (MCC) which represent the spectrum features. However, a simple representation for fundamental frequency (F0) is not enough for neural networks to deal with an emotional voice, because the time sequence of F0 for an emotional voice changes drastically. There-fore, in this paper, we propose an effective method that uses the continuous wavelet transform (CWT) to decompose F0 into different temporal scales that can be well trained by NNs for prosody modeling in emotional voice conversion. Meanwhile, the proposed method uses deep belief networks (DBNs) to pre-train the NNs that convert spectral features. By utilizing these approaches, the proposed method can change the spectrum and the prosody for an emotional voice at the same time, and was able to outperform other state-of-the-art methods for emotional voice conversion.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Speech Synthesis Workshop

自引率

0.00%

发文量