{"title":"Experiments on unsupervised statistical parametric speech synthesis","authors":"Jinfu Ni, Y. Shiga, H. Kawai, H. Kashioka","doi":"10.1109/ISCSLP.2012.6423518","DOIUrl":null,"url":null,"abstract":"In order to build web-based voicefonts, an unsupervised method is needed to automate the extraction of acoustic and linguistic properties of speech. This paper addresses the impact of automatic speech transcription on statistical parametric speech synthesis based on a single speaker's 100 hour speech corpus, focusing particularly on two factors of affecting speech quality: transcript accuracy and size of training dataset. Experimental results indicate that for an unsupervised method to achieve fair (MOS 3) voice quality, 1.5 hours of speech are necessary for phone accuracy over 80% and 3.5 hours necessary for phone accuracy down to 65%. Improvement in MOS quality turns out not to be significant when more than 4 hours of speech are used. The usage of automatic transcripts certainly leads to voice degradation. One of the mechanisms behind this is that transcript errors cause mismatches between speech segments and phone labels that significantly distort the structures of decision trees in resultant HMM-based voices.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 8th International Symposium on Chinese Spoken Language Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISCSLP.2012.6423518","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In order to build web-based voicefonts, an unsupervised method is needed to automate the extraction of acoustic and linguistic properties of speech. This paper addresses the impact of automatic speech transcription on statistical parametric speech synthesis based on a single speaker's 100 hour speech corpus, focusing particularly on two factors of affecting speech quality: transcript accuracy and size of training dataset. Experimental results indicate that for an unsupervised method to achieve fair (MOS 3) voice quality, 1.5 hours of speech are necessary for phone accuracy over 80% and 3.5 hours necessary for phone accuracy down to 65%. Improvement in MOS quality turns out not to be significant when more than 4 hours of speech are used. The usage of automatic transcripts certainly leads to voice degradation. One of the mechanisms behind this is that transcript errors cause mismatches between speech segments and phone labels that significantly distort the structures of decision trees in resultant HMM-based voices.