{"title":"使用噪声生成模型的基于dnn的语音合成方法","authors":"Masakazu Une, Yuki Saito, Shinnosuke Takamichi, Daichi Kitamura, Ryoichi Miyazaki, H. Saruwatari","doi":"10.23919/APSIPA.2018.8659691","DOIUrl":null,"url":null,"abstract":"This paper proposes a generative approach to construct high-quality speech synthesis from noisy speech. Studio-quality recorded speech is required to construct high-quality speech synthesis, but most of existing speech has been recorded in a noisy environment. A common method to use noisy speech for training speech synthesis models is reducing the noise before the vocoder-based parameterization. However, such multi-step processes cause an accumulation of spectral distortion. Meanwhile, statistical parametric speech synthesis (SPSS) without vocoders, which directly generates spectral parameters or waveforms, has been proposed recently. The vocoder-free SPSS will enable us to train speech synthesis models considering the noise addition process generally used in signal processing research. In the proposed approach, newly introduced noise generation models trained by a generative adversarial training algorithm randomly generates spectra of the noise. The speech synthesis models are trained to make the sum of their output and the randomly generated noise close to spectra of noisy speech. Because the noise generation model parameters fit the spectrum of the observed noise, the proposed method can alleviate the spectral distortion found in the conventional method. Experimental results demonstrate that the proposed method outperforms the conventional method in terms of synthetic speech quality.","PeriodicalId":287799,"journal":{"name":"2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"130 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Generative approach using the noise generation models for DNN-based speech synthesis trained from noisy speech\",\"authors\":\"Masakazu Une, Yuki Saito, Shinnosuke Takamichi, Daichi Kitamura, Ryoichi Miyazaki, H. Saruwatari\",\"doi\":\"10.23919/APSIPA.2018.8659691\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper proposes a generative approach to construct high-quality speech synthesis from noisy speech. Studio-quality recorded speech is required to construct high-quality speech synthesis, but most of existing speech has been recorded in a noisy environment. A common method to use noisy speech for training speech synthesis models is reducing the noise before the vocoder-based parameterization. However, such multi-step processes cause an accumulation of spectral distortion. Meanwhile, statistical parametric speech synthesis (SPSS) without vocoders, which directly generates spectral parameters or waveforms, has been proposed recently. The vocoder-free SPSS will enable us to train speech synthesis models considering the noise addition process generally used in signal processing research. In the proposed approach, newly introduced noise generation models trained by a generative adversarial training algorithm randomly generates spectra of the noise. The speech synthesis models are trained to make the sum of their output and the randomly generated noise close to spectra of noisy speech. Because the noise generation model parameters fit the spectrum of the observed noise, the proposed method can alleviate the spectral distortion found in the conventional method. Experimental results demonstrate that the proposed method outperforms the conventional method in terms of synthetic speech quality.\",\"PeriodicalId\":287799,\"journal\":{\"name\":\"2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)\",\"volume\":\"130 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.23919/APSIPA.2018.8659691\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/APSIPA.2018.8659691","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Generative approach using the noise generation models for DNN-based speech synthesis trained from noisy speech
This paper proposes a generative approach to construct high-quality speech synthesis from noisy speech. Studio-quality recorded speech is required to construct high-quality speech synthesis, but most of existing speech has been recorded in a noisy environment. A common method to use noisy speech for training speech synthesis models is reducing the noise before the vocoder-based parameterization. However, such multi-step processes cause an accumulation of spectral distortion. Meanwhile, statistical parametric speech synthesis (SPSS) without vocoders, which directly generates spectral parameters or waveforms, has been proposed recently. The vocoder-free SPSS will enable us to train speech synthesis models considering the noise addition process generally used in signal processing research. In the proposed approach, newly introduced noise generation models trained by a generative adversarial training algorithm randomly generates spectra of the noise. The speech synthesis models are trained to make the sum of their output and the randomly generated noise close to spectra of noisy speech. Because the noise generation model parameters fit the spectrum of the observed noise, the proposed method can alleviate the spectral distortion found in the conventional method. Experimental results demonstrate that the proposed method outperforms the conventional method in terms of synthetic speech quality.