Cassia Valentini-Botinhao, O. Watts, Felipe Espic, Simon King
{"title":"文本到语音的基于示例的语音波形生成","authors":"Cassia Valentini-Botinhao, O. Watts, Felipe Espic, Simon King","doi":"10.1109/SLT.2018.8639679","DOIUrl":null,"url":null,"abstract":"This paper presents a hybrid text-to-speech framework that uses a waveform generation method based on examplars of natural speech waveform. These examplars are selected at synthesis time given a sequence of acoustic features generated from text by a statistical parametric speech synthesis model. In order to match the expected degradation of these target synthesis features, the database of units is constructed such that the units’ target representations are generated from the same parametric model. We evaluate two variants of this framework by modifying the size of the examplar: a small unit variant (where unit boundaries are determined by pitch mark location) and a halfphone variant (where unit boundaries are determined by subphone state forced alignment). We found that for a larger dataset (around four hours of training data) the examplar-based waveform generation variants are rated higher than the vocoder-based system.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Examplar-Based Speechwaveform Generation for Text-To-Speech\",\"authors\":\"Cassia Valentini-Botinhao, O. Watts, Felipe Espic, Simon King\",\"doi\":\"10.1109/SLT.2018.8639679\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents a hybrid text-to-speech framework that uses a waveform generation method based on examplars of natural speech waveform. These examplars are selected at synthesis time given a sequence of acoustic features generated from text by a statistical parametric speech synthesis model. In order to match the expected degradation of these target synthesis features, the database of units is constructed such that the units’ target representations are generated from the same parametric model. We evaluate two variants of this framework by modifying the size of the examplar: a small unit variant (where unit boundaries are determined by pitch mark location) and a halfphone variant (where unit boundaries are determined by subphone state forced alignment). We found that for a larger dataset (around four hours of training data) the examplar-based waveform generation variants are rated higher than the vocoder-based system.\",\"PeriodicalId\":377307,\"journal\":{\"name\":\"2018 IEEE Spoken Language Technology Workshop (SLT)\",\"volume\":\"34 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE Spoken Language Technology Workshop (SLT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SLT.2018.8639679\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE Spoken Language Technology Workshop (SLT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SLT.2018.8639679","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Examplar-Based Speechwaveform Generation for Text-To-Speech
This paper presents a hybrid text-to-speech framework that uses a waveform generation method based on examplars of natural speech waveform. These examplars are selected at synthesis time given a sequence of acoustic features generated from text by a statistical parametric speech synthesis model. In order to match the expected degradation of these target synthesis features, the database of units is constructed such that the units’ target representations are generated from the same parametric model. We evaluate two variants of this framework by modifying the size of the examplar: a small unit variant (where unit boundaries are determined by pitch mark location) and a halfphone variant (where unit boundaries are determined by subphone state forced alignment). We found that for a larger dataset (around four hours of training data) the examplar-based waveform generation variants are rated higher than the vocoder-based system.