{"title":"基于演讲者-条件受限玻尔兹曼机自由能量最小化的非并行语音转换","authors":"Takuya Kishida, Toru Nakashika","doi":"10.23919/APSIPAASC55919.2022.9980151","DOIUrl":null,"url":null,"abstract":"In this paper, we propose a non-parallel voice conversion method based on the minimization of the free energy of a restricted Boltzmann machine (RBM). The proposed method uses an RBM that learns the generative probability of acoustic features conditioned on a target speaker, and it iteratively updates the input acoustic features until their free energy reaches a local minimum to obtain converted features. Since it is based on the RBM, only a few hyperparameters need to be set, and the number of training parameters is very small. Therefore, training is stable. In determining the step size of the update formula in accordance with the Newton-Raphson method to obtain the feature that gives the local minimum of the free energy, we found that the Hesse matrix of the free energy can be approximated by a diagonal matrix, and the update can be performed efficiently with a small amount of calculation. In objective evaluation experiments, the proposed method outperforms StarGAN-VC in Mel-cepstral distortions. In subjective evaluation experiments, the performance of the proposed method is comparable to that of StarGAN-VC in similarity MOS.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"C-31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Non-Parallel Voice Conversion Based on Free-Energy Minimization of Speaker-Conditional Restricted Boltzmann Machine\",\"authors\":\"Takuya Kishida, Toru Nakashika\",\"doi\":\"10.23919/APSIPAASC55919.2022.9980151\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we propose a non-parallel voice conversion method based on the minimization of the free energy of a restricted Boltzmann machine (RBM). The proposed method uses an RBM that learns the generative probability of acoustic features conditioned on a target speaker, and it iteratively updates the input acoustic features until their free energy reaches a local minimum to obtain converted features. Since it is based on the RBM, only a few hyperparameters need to be set, and the number of training parameters is very small. Therefore, training is stable. In determining the step size of the update formula in accordance with the Newton-Raphson method to obtain the feature that gives the local minimum of the free energy, we found that the Hesse matrix of the free energy can be approximated by a diagonal matrix, and the update can be performed efficiently with a small amount of calculation. In objective evaluation experiments, the proposed method outperforms StarGAN-VC in Mel-cepstral distortions. In subjective evaluation experiments, the performance of the proposed method is comparable to that of StarGAN-VC in similarity MOS.\",\"PeriodicalId\":382967,\"journal\":{\"name\":\"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)\",\"volume\":\"C-31 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-11-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.23919/APSIPAASC55919.2022.9980151\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/APSIPAASC55919.2022.9980151","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Non-Parallel Voice Conversion Based on Free-Energy Minimization of Speaker-Conditional Restricted Boltzmann Machine
In this paper, we propose a non-parallel voice conversion method based on the minimization of the free energy of a restricted Boltzmann machine (RBM). The proposed method uses an RBM that learns the generative probability of acoustic features conditioned on a target speaker, and it iteratively updates the input acoustic features until their free energy reaches a local minimum to obtain converted features. Since it is based on the RBM, only a few hyperparameters need to be set, and the number of training parameters is very small. Therefore, training is stable. In determining the step size of the update formula in accordance with the Newton-Raphson method to obtain the feature that gives the local minimum of the free energy, we found that the Hesse matrix of the free energy can be approximated by a diagonal matrix, and the update can be performed efficiently with a small amount of calculation. In objective evaluation experiments, the proposed method outperforms StarGAN-VC in Mel-cepstral distortions. In subjective evaluation experiments, the performance of the proposed method is comparable to that of StarGAN-VC in similarity MOS.