{"title":"Harmonic envelope prediction for realistic speech synthesis using kernel interpolation","authors":"P.-A. Fournier, Jean-Jules Brault","doi":"10.1109/IJCNN.2005.1556217","DOIUrl":null,"url":null,"abstract":"Harmonic and noise diphone concatenation is a proven method to obtain high-quality speech synthesis, but cannot be used when the basis corpus does not contain all the diphones needed. We propose a method to complete an individual's corpus using examples from other corpora. Parametrisation of five vowels from different speakers is done with an harmonic and noise model (HNM). We use multi-frame analysis (MFA) and smoothing kernels to estimate the harmonic power spectrum envelopes. Different kernels are compared to predict the harmonic envelopes of vowels using training data. We use euclidian distance to measure similarity between the real envelopes and the predicted ones. Synthesis of the interpolated vowels are then performed using learned optimal parameters. Our results show Gaussian kernels can achieve a 1.8 dB (34.4%) reduction of harmonic distorsion compared to the mean harmonic envelope estimator. As far as we know, there is no other literature on phoneme prediction for realistic speech synthesis.","PeriodicalId":365690,"journal":{"name":"Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005.","volume":"19 1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2005-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IJCNN.2005.1556217","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Harmonic and noise diphone concatenation is a proven method to obtain high-quality speech synthesis, but cannot be used when the basis corpus does not contain all the diphones needed. We propose a method to complete an individual's corpus using examples from other corpora. Parametrisation of five vowels from different speakers is done with an harmonic and noise model (HNM). We use multi-frame analysis (MFA) and smoothing kernels to estimate the harmonic power spectrum envelopes. Different kernels are compared to predict the harmonic envelopes of vowels using training data. We use euclidian distance to measure similarity between the real envelopes and the predicted ones. Synthesis of the interpolated vowels are then performed using learned optimal parameters. Our results show Gaussian kernels can achieve a 1.8 dB (34.4%) reduction of harmonic distorsion compared to the mean harmonic envelope estimator. As far as we know, there is no other literature on phoneme prediction for realistic speech synthesis.