基于HMM的语音合成中F0建模与生成研究进展

2012 IEEE 11th International Conference on Signal Processing Pub Date : 2012-10-01 DOI:10.1109/ICOSP.2012.6491559

Kai Yu

{"title":"基于HMM的语音合成中F0建模与生成研究进展","authors":"Kai Yu","doi":"10.1109/ICOSP.2012.6491559","DOIUrl":null,"url":null,"abstract":"Fundamental frequency, or F0, is a critical factor in synthesising speech which is both natural and expressive. In HMM based speech synthesis, the modelling and generation of F0 is one of the key difficult factors which differentiate synthesis from recognition. Firstly, this is because F0 values are normally considered as a discontinuous function of time, whose domain is partly continuous and partly discrete. This results in two issues to be addressed in F0 modelling and generation: voiced/unvoiced decision and F0 trajectory. Another important characteristics of F0 is that it is supra-segmental, which means F0 should be modelled beyond the traditional phoneme level. Thirdly, the purpose of F0 modelling is not only for general high quality synthetic speech, but also for effective delivery of expressiveness. This requires explicitly link F0 modelling to (para/non-) linguistic information so that the control of F0 is easy and feasible. This paper reviews the state-of-the-art frameworks to address these issues. Possible future research directions are also discussed.","PeriodicalId":143331,"journal":{"name":"2012 IEEE 11th International Conference on Signal Processing","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Review of F0 modelling and generation in HMM based speech synthesis\",\"authors\":\"Kai Yu\",\"doi\":\"10.1109/ICOSP.2012.6491559\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Fundamental frequency, or F0, is a critical factor in synthesising speech which is both natural and expressive. In HMM based speech synthesis, the modelling and generation of F0 is one of the key difficult factors which differentiate synthesis from recognition. Firstly, this is because F0 values are normally considered as a discontinuous function of time, whose domain is partly continuous and partly discrete. This results in two issues to be addressed in F0 modelling and generation: voiced/unvoiced decision and F0 trajectory. Another important characteristics of F0 is that it is supra-segmental, which means F0 should be modelled beyond the traditional phoneme level. Thirdly, the purpose of F0 modelling is not only for general high quality synthetic speech, but also for effective delivery of expressiveness. This requires explicitly link F0 modelling to (para/non-) linguistic information so that the control of F0 is easy and feasible. This paper reviews the state-of-the-art frameworks to address these issues. Possible future research directions are also discussed.\",\"PeriodicalId\":143331,\"journal\":{\"name\":\"2012 IEEE 11th International Conference on Signal Processing\",\"volume\":\"17 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2012 IEEE 11th International Conference on Signal Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICOSP.2012.6491559\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE 11th International Conference on Signal Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICOSP.2012.6491559","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

摘要

基本频率，或F0，是合成语音的一个关键因素，既自然又富有表现力。在基于HMM的语音合成中，F0的建模和生成是区分合成与识别的关键难点之一。首先，这是因为F0值通常被认为是时间的不连续函数，其域部分连续，部分离散。这导致在F0建模和生成中需要解决两个问题:声音/非声音决策和F0轨迹。F0的另一个重要特征是它是超分段的，这意味着F0应该超越传统的音素水平进行建模。第三，F0建模的目的不仅是为了一般高质量的合成语音，而且是为了有效地传递表现力。这需要明确地将F0建模与(para/non-)语言信息联系起来，以便F0的控制简单可行。本文回顾了解决这些问题的最新框架。并对未来可能的研究方向进行了讨论。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Review of F0 modelling and generation in HMM based speech synthesis

Fundamental frequency, or F0, is a critical factor in synthesising speech which is both natural and expressive. In HMM based speech synthesis, the modelling and generation of F0 is one of the key difficult factors which differentiate synthesis from recognition. Firstly, this is because F0 values are normally considered as a discontinuous function of time, whose domain is partly continuous and partly discrete. This results in two issues to be addressed in F0 modelling and generation: voiced/unvoiced decision and F0 trajectory. Another important characteristics of F0 is that it is supra-segmental, which means F0 should be modelled beyond the traditional phoneme level. Thirdly, the purpose of F0 modelling is not only for general high quality synthetic speech, but also for effective delivery of expressiveness. This requires explicitly link F0 modelling to (para/non-) linguistic information so that the control of F0 is easy and feasible. This paper reviews the state-of-the-art frameworks to address these issues. Possible future research directions are also discussed.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2012 IEEE 11th International Conference on Signal Processing

自引率

0.00%

发文量