韵律增强的普通话文本转语音系统

2021 3rd International Conference on Advances in Computer Technology, Information Science and Communication (CTISC) Pub Date : 2021-04-01 DOI:10.1109/CTISC52352.2021.00020

Fangfang Niu, Wushour Silamu

{"title":"韵律增强的普通话文本转语音系统","authors":"Fangfang Niu, Wushour Silamu","doi":"10.1109/CTISC52352.2021.00020","DOIUrl":null,"url":null,"abstract":"The end-to-end Text-to-Speech (TTS), which can generate speech directly from a given sequence of graphemes or phonemes, has shown superior performance over the conventional TTS. It has been able to generate high-quality speech, but it is still unable to control the local prosody such as word-level emphasis. Although the prominence of synthesized speech can be adjusted by explicit prosody tags, the acquisition of such tags is often time-consuming and laborious. This paper focuses on a deep neural prominence prediction module, using Continuous Wavelet Transform (CWT) to analyze the prosodic signal of input data, get the corresponding continuous prominence values of Chinese characters in the text to guide the training of a prominence prediction network, so that it can realize the mapping from the input text to the corresponding prominence value of each Chinese character in the text. The proposed method does not need to label the training data manually, so a fully automatic prosody control system is realized. Experiments show that the proposed system can generate more natural and expressive speech.","PeriodicalId":268378,"journal":{"name":"2021 3rd International Conference on Advances in Computer Technology, Information Science and Communication (CTISC)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Prosody-Enhanced Mandarin Text-to-Speech System\",\"authors\":\"Fangfang Niu, Wushour Silamu\",\"doi\":\"10.1109/CTISC52352.2021.00020\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The end-to-end Text-to-Speech (TTS), which can generate speech directly from a given sequence of graphemes or phonemes, has shown superior performance over the conventional TTS. It has been able to generate high-quality speech, but it is still unable to control the local prosody such as word-level emphasis. Although the prominence of synthesized speech can be adjusted by explicit prosody tags, the acquisition of such tags is often time-consuming and laborious. This paper focuses on a deep neural prominence prediction module, using Continuous Wavelet Transform (CWT) to analyze the prosodic signal of input data, get the corresponding continuous prominence values of Chinese characters in the text to guide the training of a prominence prediction network, so that it can realize the mapping from the input text to the corresponding prominence value of each Chinese character in the text. The proposed method does not need to label the training data manually, so a fully automatic prosody control system is realized. Experiments show that the proposed system can generate more natural and expressive speech.\",\"PeriodicalId\":268378,\"journal\":{\"name\":\"2021 3rd International Conference on Advances in Computer Technology, Information Science and Communication (CTISC)\",\"volume\":\"25 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 3rd International Conference on Advances in Computer Technology, Information Science and Communication (CTISC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CTISC52352.2021.00020\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 3rd International Conference on Advances in Computer Technology, Information Science and Communication (CTISC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CTISC52352.2021.00020","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

端到端文本到语音(TTS)可以直接从给定的字素或音素序列生成语音，显示出优于传统TTS的性能。它已经能够产生高质量的语音，但仍然无法控制局部韵律，如单词级别的强调。虽然合成语音的突出性可以通过明确的韵律标签来调整，但这种标签的获取往往是费时费力的。本文重点研究了一个深度神经突出预测模块，利用连续小波变换(CWT)对输入数据的周期信号进行分析，得到文本中汉字对应的连续突出值来指导突出预测网络的训练，从而实现从输入文本到文本中每个汉字对应的突出值的映射。该方法不需要对训练数据进行人工标注，实现了一个全自动韵律控制系统。实验表明，该系统能够生成更自然、更有表现力的语音。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Prosody-Enhanced Mandarin Text-to-Speech System

The end-to-end Text-to-Speech (TTS), which can generate speech directly from a given sequence of graphemes or phonemes, has shown superior performance over the conventional TTS. It has been able to generate high-quality speech, but it is still unable to control the local prosody such as word-level emphasis. Although the prominence of synthesized speech can be adjusted by explicit prosody tags, the acquisition of such tags is often time-consuming and laborious. This paper focuses on a deep neural prominence prediction module, using Continuous Wavelet Transform (CWT) to analyze the prosodic signal of input data, get the corresponding continuous prominence values of Chinese characters in the text to guide the training of a prominence prediction network, so that it can realize the mapping from the input text to the corresponding prominence value of each Chinese character in the text. The proposed method does not need to label the training data manually, so a fully automatic prosody control system is realized. Experiments show that the proposed system can generate more natural and expressive speech.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 3rd International Conference on Advances in Computer Technology, Information Science and Communication (CTISC)

自引率

0.00%

发文量