音节时长作为潜在韵律特征的代表

Speech Prosody 2022 Pub Date : 2022-05-23 DOI:10.21437/speechprosody.2022-45

Christina Tånnander, D. House, Jens Edlund

{"title":"音节时长作为潜在韵律特征的代表","authors":"Christina Tånnander, D. House, Jens Edlund","doi":"10.21437/speechprosody.2022-45","DOIUrl":null,"url":null,"abstract":"Recent advances in deep-learning have pushed text-to-speech synthesis (TTS) very close to human speech. In deep-learning, latent features refer to features that are hidden from us; notwithstanding, we may meaningfully observe their effects. Analogously, latent prosodic features refer to the exact features that constitute e.g. prominence that are unknown to us, although we know (some of) the functions of prominence and (some of) its acoustic correlates. Deep-learned speech models capture prosody well but leave us with little control and few insights. Previously, we explored average syllable duration on word level - a simple and accessible metric - as a proxy for prominence: in Swedish TTS, where verb particles and numerals tend to receive too little prominence, these were nudged towards lengthening while allowing the TTS models to otherwise operate freely. Listener panels overwhelmingly preferred the nudged versions to the unmodified TTS. In this paper, we analyze utterances from the modified TTS. The analysis shows that duration-nudging of relevant words changes the following features in an observable manner: duration is predictably lengthened, word-initial glottalization occurs, and the general intonation pattern changes. This supports the view of latent prosodic features that can be reflected in deep-learned models and accessed by proxy.","PeriodicalId":442842,"journal":{"name":"Speech Prosody 2022","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Syllable duration as a proxy to latent prosodic features\",\"authors\":\"Christina Tånnander, D. House, Jens Edlund\",\"doi\":\"10.21437/speechprosody.2022-45\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent advances in deep-learning have pushed text-to-speech synthesis (TTS) very close to human speech. In deep-learning, latent features refer to features that are hidden from us; notwithstanding, we may meaningfully observe their effects. Analogously, latent prosodic features refer to the exact features that constitute e.g. prominence that are unknown to us, although we know (some of) the functions of prominence and (some of) its acoustic correlates. Deep-learned speech models capture prosody well but leave us with little control and few insights. Previously, we explored average syllable duration on word level - a simple and accessible metric - as a proxy for prominence: in Swedish TTS, where verb particles and numerals tend to receive too little prominence, these were nudged towards lengthening while allowing the TTS models to otherwise operate freely. Listener panels overwhelmingly preferred the nudged versions to the unmodified TTS. In this paper, we analyze utterances from the modified TTS. The analysis shows that duration-nudging of relevant words changes the following features in an observable manner: duration is predictably lengthened, word-initial glottalization occurs, and the general intonation pattern changes. This supports the view of latent prosodic features that can be reflected in deep-learned models and accessed by proxy.\",\"PeriodicalId\":442842,\"journal\":{\"name\":\"Speech Prosody 2022\",\"volume\":\"12 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-05-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Speech Prosody 2022\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21437/speechprosody.2022-45\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Prosody 2022","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/speechprosody.2022-45","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

深度学习的最新进展使文本到语音合成(TTS)非常接近人类语音。在深度学习中，潜在特征是指我们看不到的特征;尽管如此，我们还是可以有意义地观察到它们的影响。类似地，潜在韵律特征指的是构成例如突出音的确切特征，尽管我们知道突出音的(一些)功能和(一些)与之相关的声学特征，但我们不知道这些特征。深度学习的语音模型很好地捕捉了韵律，但我们几乎没有控制力和洞察力。在此之前，我们探索了单词水平上的平均音节持续时间——一个简单易懂的度量标准——作为突出度的代表:在瑞典语TTS中，动词颗粒和数字往往得到的突出度太少，这些被推动到延长，同时允许TTS模型自由运行。与未修改的TTS相比，绝大多数听众更喜欢修改后的版本。本文对改进后的TTS语音进行分析。分析表明，相关词的持续时间变化显著地改变了以下特征:持续时间可预测地延长了，词首音化发生了，总体语调模式发生了变化。这支持了潜在韵律特征的观点，这些特征可以反映在深度学习模型中，并通过代理访问。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Syllable duration as a proxy to latent prosodic features

Recent advances in deep-learning have pushed text-to-speech synthesis (TTS) very close to human speech. In deep-learning, latent features refer to features that are hidden from us; notwithstanding, we may meaningfully observe their effects. Analogously, latent prosodic features refer to the exact features that constitute e.g. prominence that are unknown to us, although we know (some of) the functions of prominence and (some of) its acoustic correlates. Deep-learned speech models capture prosody well but leave us with little control and few insights. Previously, we explored average syllable duration on word level - a simple and accessible metric - as a proxy for prominence: in Swedish TTS, where verb particles and numerals tend to receive too little prominence, these were nudged towards lengthening while allowing the TTS models to otherwise operate freely. Listener panels overwhelmingly preferred the nudged versions to the unmodified TTS. In this paper, we analyze utterances from the modified TTS. The analysis shows that duration-nudging of relevant words changes the following features in an observable manner: duration is predictably lengthened, word-initial glottalization occurs, and the general intonation pattern changes. This supports the view of latent prosodic features that can be reflected in deep-learned models and accessed by proxy.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Speech Prosody 2022

自引率

0.00%

发文量