T. Okamoto, Kentaro Tachibana, T. Toda, Y. Shiga, H. Kawai
{"title":"覆盖整个可听频率范围的有限声学特征子带波声编码器研究","authors":"T. Okamoto, Kentaro Tachibana, T. Toda, Y. Shiga, H. Kawai","doi":"10.1109/ICASSP.2018.8462237","DOIUrl":null,"url":null,"abstract":"Although a WaveNet vocoder can synthesize more natural-sounding speech waveforms than conventional vocoders with sampling frequencies of 16 and 24 kHz, it is difficult to directly extend the sampling frequency to 48 kHz to cover the entire human audible frequency range for higher-quality synthesis because the model size becomes too large to train with a consumer GPU. For a WaveNet vocoder with a sampling frequency of 48 kHz with a consumer GPU, this paper introduces a subband WaveNet architecture to a speaker-dependent WaveNet vocoder and proposes a subband WaveNet vocoder. In experiments, each conditional subband WaveNet with a sampling frequency of 8 kHz was well trained using a consumer GPU. The results of subjective evaluations with a Japanese male speech corpus indicate that the proposed subband WaveNet vocoder with 36-dimensional simple acoustic features significantly outperformed the conventional source-filter model-based vocoders including STRAIGHT with 86-dimensional features.","PeriodicalId":6638,"journal":{"name":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"14 2 1","pages":"5654-5658"},"PeriodicalIF":0.0000,"publicationDate":"2018-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"21","resultStr":"{\"title\":\"An Investigation of Subband Wavenet Vocoder Covering Entire Audible Frequency Range with Limited Acoustic Features\",\"authors\":\"T. Okamoto, Kentaro Tachibana, T. Toda, Y. Shiga, H. Kawai\",\"doi\":\"10.1109/ICASSP.2018.8462237\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Although a WaveNet vocoder can synthesize more natural-sounding speech waveforms than conventional vocoders with sampling frequencies of 16 and 24 kHz, it is difficult to directly extend the sampling frequency to 48 kHz to cover the entire human audible frequency range for higher-quality synthesis because the model size becomes too large to train with a consumer GPU. For a WaveNet vocoder with a sampling frequency of 48 kHz with a consumer GPU, this paper introduces a subband WaveNet architecture to a speaker-dependent WaveNet vocoder and proposes a subband WaveNet vocoder. In experiments, each conditional subband WaveNet with a sampling frequency of 8 kHz was well trained using a consumer GPU. The results of subjective evaluations with a Japanese male speech corpus indicate that the proposed subband WaveNet vocoder with 36-dimensional simple acoustic features significantly outperformed the conventional source-filter model-based vocoders including STRAIGHT with 86-dimensional features.\",\"PeriodicalId\":6638,\"journal\":{\"name\":\"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)\",\"volume\":\"14 2 1\",\"pages\":\"5654-5658\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-04-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"21\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICASSP.2018.8462237\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICASSP.2018.8462237","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
An Investigation of Subband Wavenet Vocoder Covering Entire Audible Frequency Range with Limited Acoustic Features
Although a WaveNet vocoder can synthesize more natural-sounding speech waveforms than conventional vocoders with sampling frequencies of 16 and 24 kHz, it is difficult to directly extend the sampling frequency to 48 kHz to cover the entire human audible frequency range for higher-quality synthesis because the model size becomes too large to train with a consumer GPU. For a WaveNet vocoder with a sampling frequency of 48 kHz with a consumer GPU, this paper introduces a subband WaveNet architecture to a speaker-dependent WaveNet vocoder and proposes a subband WaveNet vocoder. In experiments, each conditional subband WaveNet with a sampling frequency of 8 kHz was well trained using a consumer GPU. The results of subjective evaluations with a Japanese male speech corpus indicate that the proposed subband WaveNet vocoder with 36-dimensional simple acoustic features significantly outperformed the conventional source-filter model-based vocoders including STRAIGHT with 86-dimensional features.