{"title":"Modeling vowel duration for Japanese text-to-speech synthesis","authors":"J. Venditti, J. V. Santen","doi":"10.21437/ICSLP.1998-46","DOIUrl":null,"url":null,"abstract":"Accurate estimation of segmental durations is crucial for natural-sounding text-to-speech (TTS) synthesis. This paper presents a model of vowel duration used in the Bell Labs JapaneseTTS system. We describe the constraints on vowel devoicing, and effects of factors such as phone identity, surrounding phone identities, accentuation, syllabic structure, and phrasal position on the duration of both long and short vowels. A Sum-of-Products ap-proach is used to model key interactions observed in the data, and to predict values of factor combinations not found in the speech database. We report root mean squared deviations between observed and predicted durations ranging from 8 to 15 ms, and an overall correlation of 0.89. in Tokyo Japanese read speech for in Labs JapaneseTTS","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"5th International Conference on Spoken Language Processing (ICSLP 1998)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/ICSLP.1998-46","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 11
Abstract
Accurate estimation of segmental durations is crucial for natural-sounding text-to-speech (TTS) synthesis. This paper presents a model of vowel duration used in the Bell Labs JapaneseTTS system. We describe the constraints on vowel devoicing, and effects of factors such as phone identity, surrounding phone identities, accentuation, syllabic structure, and phrasal position on the duration of both long and short vowels. A Sum-of-Products ap-proach is used to model key interactions observed in the data, and to predict values of factor combinations not found in the speech database. We report root mean squared deviations between observed and predicted durations ranging from 8 to 15 ms, and an overall correlation of 0.89. in Tokyo Japanese read speech for in Labs JapaneseTTS