{"title":"考虑基于音节韵律特征的基于gpr的语音合成增强F0生成","authors":"Decha Moungsri, Tomoki Koriyama, Takao Kobayashi","doi":"10.1109/APSIPA.2017.8282285","DOIUrl":null,"url":null,"abstract":"The conventional frame-level Gaussian process regression (GPR)-based F0 generation can produce natural sounding pitch contours. However, a frame-level model is insufficient to represent pitch patterns in longer unit, especially for syllable- level tone contours in tonal languages. This paper proposes a multi-level modeling technique for improving GPR-based F0 generation, in which syllable-level model is considered as well as the frame-level model. In the syllable-level model, we use the discrete cosine transform (DCT) coefficients extracted from log F0 contour in syllable unit as the output variables of Gaussian process. F0 contours are generated by jointly maximizing predictive distribution of frame- and syllable-level models. Experimental results of objective evaluation show improvement in F0 generation when using a small amount of training data around 30 minutes.","PeriodicalId":142091,"journal":{"name":"2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Enhanced F0 generation for GPR-based speech synthesis considering syllable-based prosodic features\",\"authors\":\"Decha Moungsri, Tomoki Koriyama, Takao Kobayashi\",\"doi\":\"10.1109/APSIPA.2017.8282285\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The conventional frame-level Gaussian process regression (GPR)-based F0 generation can produce natural sounding pitch contours. However, a frame-level model is insufficient to represent pitch patterns in longer unit, especially for syllable- level tone contours in tonal languages. This paper proposes a multi-level modeling technique for improving GPR-based F0 generation, in which syllable-level model is considered as well as the frame-level model. In the syllable-level model, we use the discrete cosine transform (DCT) coefficients extracted from log F0 contour in syllable unit as the output variables of Gaussian process. F0 contours are generated by jointly maximizing predictive distribution of frame- and syllable-level models. Experimental results of objective evaluation show improvement in F0 generation when using a small amount of training data around 30 minutes.\",\"PeriodicalId\":142091,\"journal\":{\"name\":\"2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)\",\"volume\":\"42 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/APSIPA.2017.8282285\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/APSIPA.2017.8282285","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Enhanced F0 generation for GPR-based speech synthesis considering syllable-based prosodic features
The conventional frame-level Gaussian process regression (GPR)-based F0 generation can produce natural sounding pitch contours. However, a frame-level model is insufficient to represent pitch patterns in longer unit, especially for syllable- level tone contours in tonal languages. This paper proposes a multi-level modeling technique for improving GPR-based F0 generation, in which syllable-level model is considered as well as the frame-level model. In the syllable-level model, we use the discrete cosine transform (DCT) coefficients extracted from log F0 contour in syllable unit as the output variables of Gaussian process. F0 contours are generated by jointly maximizing predictive distribution of frame- and syllable-level models. Experimental results of objective evaluation show improvement in F0 generation when using a small amount of training data around 30 minutes.