{"title":"韵律增强的普通话文本转语音系统","authors":"Fangfang Niu, Wushour Silamu","doi":"10.1109/CTISC52352.2021.00020","DOIUrl":null,"url":null,"abstract":"The end-to-end Text-to-Speech (TTS), which can generate speech directly from a given sequence of graphemes or phonemes, has shown superior performance over the conventional TTS. It has been able to generate high-quality speech, but it is still unable to control the local prosody such as word-level emphasis. Although the prominence of synthesized speech can be adjusted by explicit prosody tags, the acquisition of such tags is often time-consuming and laborious. This paper focuses on a deep neural prominence prediction module, using Continuous Wavelet Transform (CWT) to analyze the prosodic signal of input data, get the corresponding continuous prominence values of Chinese characters in the text to guide the training of a prominence prediction network, so that it can realize the mapping from the input text to the corresponding prominence value of each Chinese character in the text. The proposed method does not need to label the training data manually, so a fully automatic prosody control system is realized. Experiments show that the proposed system can generate more natural and expressive speech.","PeriodicalId":268378,"journal":{"name":"2021 3rd International Conference on Advances in Computer Technology, Information Science and Communication (CTISC)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Prosody-Enhanced Mandarin Text-to-Speech System\",\"authors\":\"Fangfang Niu, Wushour Silamu\",\"doi\":\"10.1109/CTISC52352.2021.00020\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The end-to-end Text-to-Speech (TTS), which can generate speech directly from a given sequence of graphemes or phonemes, has shown superior performance over the conventional TTS. It has been able to generate high-quality speech, but it is still unable to control the local prosody such as word-level emphasis. Although the prominence of synthesized speech can be adjusted by explicit prosody tags, the acquisition of such tags is often time-consuming and laborious. This paper focuses on a deep neural prominence prediction module, using Continuous Wavelet Transform (CWT) to analyze the prosodic signal of input data, get the corresponding continuous prominence values of Chinese characters in the text to guide the training of a prominence prediction network, so that it can realize the mapping from the input text to the corresponding prominence value of each Chinese character in the text. The proposed method does not need to label the training data manually, so a fully automatic prosody control system is realized. Experiments show that the proposed system can generate more natural and expressive speech.\",\"PeriodicalId\":268378,\"journal\":{\"name\":\"2021 3rd International Conference on Advances in Computer Technology, Information Science and Communication (CTISC)\",\"volume\":\"25 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 3rd International Conference on Advances in Computer Technology, Information Science and Communication (CTISC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CTISC52352.2021.00020\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 3rd International Conference on Advances in Computer Technology, Information Science and Communication (CTISC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CTISC52352.2021.00020","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
The end-to-end Text-to-Speech (TTS), which can generate speech directly from a given sequence of graphemes or phonemes, has shown superior performance over the conventional TTS. It has been able to generate high-quality speech, but it is still unable to control the local prosody such as word-level emphasis. Although the prominence of synthesized speech can be adjusted by explicit prosody tags, the acquisition of such tags is often time-consuming and laborious. This paper focuses on a deep neural prominence prediction module, using Continuous Wavelet Transform (CWT) to analyze the prosodic signal of input data, get the corresponding continuous prominence values of Chinese characters in the text to guide the training of a prominence prediction network, so that it can realize the mapping from the input text to the corresponding prominence value of each Chinese character in the text. The proposed method does not need to label the training data manually, so a fully automatic prosody control system is realized. Experiments show that the proposed system can generate more natural and expressive speech.