CyFi-TTS:端到端文本到语音的细粒度表示循环归一化流

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Pub Date : 2023-06-04 DOI:10.1109/ICASSP49357.2023.10095323

In-Sun Hwang, Young-Sub Han, Byoung-Ki Jeon

{"title":"CyFi-TTS:端到端文本到语音的细粒度表示循环归一化流","authors":"In-Sun Hwang, Young-Sub Han, Byoung-Ki Jeon","doi":"10.1109/ICASSP49357.2023.10095323","DOIUrl":null,"url":null,"abstract":"Advanced end-to-end text-to-speech (TTS) systems directly generate high-quality speech. These systems demonstrate superior performance on the seen dataset from training. However, inferring speech using unseen transcripts is challenging. Usually, the generated speech tends to be mispronounced because the one-to-many problem creates an information gap between the text and speech. To address these problems, we propose a cyclic normalizing flow with fine-grained representation for end-to-end text-to-speech (CyFi-TTS), which generates natural-sounding speech by bridging the information gap. We leverage a temporal multi-resolution upsampler to progressively produce a fine-grained representation. Furthermore, we adopt a cyclic normalizing flow to produce an acoustic representation through cyclic representation learning. Experimental results reveal that CyFi-TTS directly generates speech with clear pronunciation compared to recent TTS systems. Furthermore, CyFi-TTS achieves a mean opinion score of 4.02 and a character error rate of 1.99%.","PeriodicalId":113072,"journal":{"name":"ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CyFi-TTS: Cyclic Normalizing Flow with Fine-Grained Representation for End-to-End Text-to-Speech\",\"authors\":\"In-Sun Hwang, Young-Sub Han, Byoung-Ki Jeon\",\"doi\":\"10.1109/ICASSP49357.2023.10095323\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Advanced end-to-end text-to-speech (TTS) systems directly generate high-quality speech. These systems demonstrate superior performance on the seen dataset from training. However, inferring speech using unseen transcripts is challenging. Usually, the generated speech tends to be mispronounced because the one-to-many problem creates an information gap between the text and speech. To address these problems, we propose a cyclic normalizing flow with fine-grained representation for end-to-end text-to-speech (CyFi-TTS), which generates natural-sounding speech by bridging the information gap. We leverage a temporal multi-resolution upsampler to progressively produce a fine-grained representation. Furthermore, we adopt a cyclic normalizing flow to produce an acoustic representation through cyclic representation learning. Experimental results reveal that CyFi-TTS directly generates speech with clear pronunciation compared to recent TTS systems. Furthermore, CyFi-TTS achieves a mean opinion score of 4.02 and a character error rate of 1.99%.\",\"PeriodicalId\":113072,\"journal\":{\"name\":\"ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)\",\"volume\":\"51 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-06-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICASSP49357.2023.10095323\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICASSP49357.2023.10095323","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

先进的端到端文本转语音(TTS)系统直接生成高质量的语音。这些系统在训练中看到的数据集上表现出优异的性能。然而，使用看不见的文本来推断语音是具有挑战性的。通常，生成的语音容易发错音，因为一对多问题在文本和语音之间造成了信息鸿沟。为了解决这些问题，我们提出了端到端文本到语音(CyFi-TTS)的细粒度表示的循环规范化流，通过弥合信息差距产生自然的语音。我们利用时间多分辨率上采样器逐步产生细粒度表示。此外，我们采用循环归一化流，通过循环表征学习产生声学表征。实验结果表明，与现有的TTS系统相比，CyFi-TTS系统可以直接生成发音清晰的语音。此外，CyFi-TTS的平均意见得分为4.02，字符错误率为1.99%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

CyFi-TTS: Cyclic Normalizing Flow with Fine-Grained Representation for End-to-End Text-to-Speech

Advanced end-to-end text-to-speech (TTS) systems directly generate high-quality speech. These systems demonstrate superior performance on the seen dataset from training. However, inferring speech using unseen transcripts is challenging. Usually, the generated speech tends to be mispronounced because the one-to-many problem creates an information gap between the text and speech. To address these problems, we propose a cyclic normalizing flow with fine-grained representation for end-to-end text-to-speech (CyFi-TTS), which generates natural-sounding speech by bridging the information gap. We leverage a temporal multi-resolution upsampler to progressively produce a fine-grained representation. Furthermore, we adopt a cyclic normalizing flow to produce an acoustic representation through cyclic representation learning. Experimental results reveal that CyFi-TTS directly generates speech with clear pronunciation compared to recent TTS systems. Furthermore, CyFi-TTS achieves a mean opinion score of 4.02 and a character error rate of 1.99%.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

自引率

0.00%

发文量