CyFi-TTS:端到端文本到语音的细粒度表示循环归一化流

In-Sun Hwang, Young-Sub Han, Byoung-Ki Jeon
{"title":"CyFi-TTS:端到端文本到语音的细粒度表示循环归一化流","authors":"In-Sun Hwang, Young-Sub Han, Byoung-Ki Jeon","doi":"10.1109/ICASSP49357.2023.10095323","DOIUrl":null,"url":null,"abstract":"Advanced end-to-end text-to-speech (TTS) systems directly generate high-quality speech. These systems demonstrate superior performance on the seen dataset from training. However, inferring speech using unseen transcripts is challenging. Usually, the generated speech tends to be mispronounced because the one-to-many problem creates an information gap between the text and speech. To address these problems, we propose a cyclic normalizing flow with fine-grained representation for end-to-end text-to-speech (CyFi-TTS), which generates natural-sounding speech by bridging the information gap. We leverage a temporal multi-resolution upsampler to progressively produce a fine-grained representation. Furthermore, we adopt a cyclic normalizing flow to produce an acoustic representation through cyclic representation learning. Experimental results reveal that CyFi-TTS directly generates speech with clear pronunciation compared to recent TTS systems. Furthermore, CyFi-TTS achieves a mean opinion score of 4.02 and a character error rate of 1.99%.","PeriodicalId":113072,"journal":{"name":"ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CyFi-TTS: Cyclic Normalizing Flow with Fine-Grained Representation for End-to-End Text-to-Speech\",\"authors\":\"In-Sun Hwang, Young-Sub Han, Byoung-Ki Jeon\",\"doi\":\"10.1109/ICASSP49357.2023.10095323\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Advanced end-to-end text-to-speech (TTS) systems directly generate high-quality speech. These systems demonstrate superior performance on the seen dataset from training. However, inferring speech using unseen transcripts is challenging. Usually, the generated speech tends to be mispronounced because the one-to-many problem creates an information gap between the text and speech. To address these problems, we propose a cyclic normalizing flow with fine-grained representation for end-to-end text-to-speech (CyFi-TTS), which generates natural-sounding speech by bridging the information gap. We leverage a temporal multi-resolution upsampler to progressively produce a fine-grained representation. Furthermore, we adopt a cyclic normalizing flow to produce an acoustic representation through cyclic representation learning. Experimental results reveal that CyFi-TTS directly generates speech with clear pronunciation compared to recent TTS systems. Furthermore, CyFi-TTS achieves a mean opinion score of 4.02 and a character error rate of 1.99%.\",\"PeriodicalId\":113072,\"journal\":{\"name\":\"ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)\",\"volume\":\"51 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-06-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICASSP49357.2023.10095323\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICASSP49357.2023.10095323","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

先进的端到端文本转语音(TTS)系统直接生成高质量的语音。这些系统在训练中看到的数据集上表现出优异的性能。然而,使用看不见的文本来推断语音是具有挑战性的。通常,生成的语音容易发错音,因为一对多问题在文本和语音之间造成了信息鸿沟。为了解决这些问题,我们提出了端到端文本到语音(CyFi-TTS)的细粒度表示的循环规范化流,通过弥合信息差距产生自然的语音。我们利用时间多分辨率上采样器逐步产生细粒度表示。此外,我们采用循环归一化流,通过循环表征学习产生声学表征。实验结果表明,与现有的TTS系统相比,CyFi-TTS系统可以直接生成发音清晰的语音。此外,CyFi-TTS的平均意见得分为4.02,字符错误率为1.99%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
CyFi-TTS: Cyclic Normalizing Flow with Fine-Grained Representation for End-to-End Text-to-Speech
Advanced end-to-end text-to-speech (TTS) systems directly generate high-quality speech. These systems demonstrate superior performance on the seen dataset from training. However, inferring speech using unseen transcripts is challenging. Usually, the generated speech tends to be mispronounced because the one-to-many problem creates an information gap between the text and speech. To address these problems, we propose a cyclic normalizing flow with fine-grained representation for end-to-end text-to-speech (CyFi-TTS), which generates natural-sounding speech by bridging the information gap. We leverage a temporal multi-resolution upsampler to progressively produce a fine-grained representation. Furthermore, we adopt a cyclic normalizing flow to produce an acoustic representation through cyclic representation learning. Experimental results reveal that CyFi-TTS directly generates speech with clear pronunciation compared to recent TTS systems. Furthermore, CyFi-TTS achieves a mean opinion score of 4.02 and a character error rate of 1.99%.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信