韵律语音:神经文本到语音的高级韵律模型

Yuanhao Yi, Lei He, Shifeng Pan, Xi Wang, Yujia Xiao
{"title":"韵律语音:神经文本到语音的高级韵律模型","authors":"Yuanhao Yi, Lei He, Shifeng Pan, Xi Wang, Yujia Xiao","doi":"10.1109/ICASSP43922.2022.9746744","DOIUrl":null,"url":null,"abstract":"This paper proposes ProsodySpeech, a novel prosody model to enhance encoder-decoder neural Text-To-Speech (TTS), to generate high expressive and personalized speech even with very limited training data. First, we use a Prosody Extractor built from a large speech corpus with various speakers to generate a set of prosody exemplars from multiple reference speeches, in which Mutual Information based Style content separation (MIST) is adopted to alleviate \"content leakage\" problem. Second, we use a Prosody Distributor to make a soft selection of appropriate prosody exemplars in phone-level with the help of an attention mechanism. The resulting prosody feature is then aggregated into the output of text encoder, together with additional phone-level pitch feature to enrich the prosody. We apply this method into two tasks: highly expressive multi style/emotion TTS and few-shot personalized TTS. The experiments show the proposed model outperforms baseline FastSpeech 2 + GST with significant improvements in terms of similarity and style expression.","PeriodicalId":272439,"journal":{"name":"ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Prosodyspeech: Towards Advanced Prosody Model for Neural Text-to-Speech\",\"authors\":\"Yuanhao Yi, Lei He, Shifeng Pan, Xi Wang, Yujia Xiao\",\"doi\":\"10.1109/ICASSP43922.2022.9746744\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper proposes ProsodySpeech, a novel prosody model to enhance encoder-decoder neural Text-To-Speech (TTS), to generate high expressive and personalized speech even with very limited training data. First, we use a Prosody Extractor built from a large speech corpus with various speakers to generate a set of prosody exemplars from multiple reference speeches, in which Mutual Information based Style content separation (MIST) is adopted to alleviate \\\"content leakage\\\" problem. Second, we use a Prosody Distributor to make a soft selection of appropriate prosody exemplars in phone-level with the help of an attention mechanism. The resulting prosody feature is then aggregated into the output of text encoder, together with additional phone-level pitch feature to enrich the prosody. We apply this method into two tasks: highly expressive multi style/emotion TTS and few-shot personalized TTS. The experiments show the proposed model outperforms baseline FastSpeech 2 + GST with significant improvements in terms of similarity and style expression.\",\"PeriodicalId\":272439,\"journal\":{\"name\":\"ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)\",\"volume\":\"19 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-05-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICASSP43922.2022.9746744\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICASSP43922.2022.9746744","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

摘要

本文提出了一种新的韵律模型ProsodySpeech,用于增强编码器-解码器神经文本到语音(TTS),即使在非常有限的训练数据下也能生成高表现力和个性化的语音。首先,我们使用基于不同说话人的大型语料库构建的韵律提取器,从多个参考演讲中生成韵律样例,其中采用基于互信息的风格内容分离(MIST)来缓解“内容泄漏”问题。其次,在注意机制的帮助下,我们使用韵律分发器对语音层面的韵律范例进行软选择。然后将得到的韵律特征聚合到文本编码器的输出中,并与额外的电话级音高特征一起丰富韵律。我们将该方法应用于两个任务:高表现力的多风格/情感TTS和少镜头个性化TTS。实验表明,该模型在相似度和风格表达方面有显著改善,优于基线FastSpeech 2 + GST。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Prosodyspeech: Towards Advanced Prosody Model for Neural Text-to-Speech
This paper proposes ProsodySpeech, a novel prosody model to enhance encoder-decoder neural Text-To-Speech (TTS), to generate high expressive and personalized speech even with very limited training data. First, we use a Prosody Extractor built from a large speech corpus with various speakers to generate a set of prosody exemplars from multiple reference speeches, in which Mutual Information based Style content separation (MIST) is adopted to alleviate "content leakage" problem. Second, we use a Prosody Distributor to make a soft selection of appropriate prosody exemplars in phone-level with the help of an attention mechanism. The resulting prosody feature is then aggregated into the output of text encoder, together with additional phone-level pitch feature to enrich the prosody. We apply this method into two tasks: highly expressive multi style/emotion TTS and few-shot personalized TTS. The experiments show the proposed model outperforms baseline FastSpeech 2 + GST with significant improvements in terms of similarity and style expression.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信