{"title":"一种新的汉语拼接语音单元选择和平滑方法","authors":"Xiaoxia Yang, Zhi-cheng Liu, Qilong Sun, Hao Wang","doi":"10.2991/MASTA-19.2019.47","DOIUrl":null,"url":null,"abstract":"This paper introduces a new approach to unit selection and unit concatenation, in which Chinese character is the smallest unit in speech corpus and at concatenation stage, speech segments are not only concatenated in phase, but also in amplitude. A conventional hybrid system is used in this paper. Firstly, LSTM were adopted for acoustic model and duration model, and prosody is predicted by Conditional Random Fields (CRFs). Secondly, without considering continuously-valued cost, we use Dynamic Time Warping (DTW) directly to select units with acoustic features such as mel-cepstrum and Fundamental Frequency (F0). At last, an improved cross-fade method taking amplitude into account is adopted in waveform concatenation to improve smoothing and very natural speech is synthesized. Introduction The recent rise of deep neural networks (DNNs) has brought an increase in performance in both automatic speech recognition (ASR) statistical text-to-speech (TTS) technology [1]. Parameter synthesis system and unit concatenation synthesis system as two mainstream speech synthesis systems also have been well-developed because of DNNs. Not like parameter synthesis method using vocoder such as WORLD to synthesize speech, concatenation synthesis method select proper units from multiple instances to achieve better flexibility quality in prosody and timbre [2]. Tonal syllable is considered to be the basic units in synthesis system for Mandarin, because of very strong co-articulation between phonemes in a same syllable and much less concatenation points between syllables [2]. Some researchers find the juncture between phonemes, between syllables, between rhythm units and between stops in Chinese, between phonemes are the strongest, the one between syllables is next stronger and the others are weaker [3,4] trains a group of syllable classifiers, which takes not continuously-valued cost into each candidate unit. [5,6] also suggest that eliminating the unacceptable units is crucial to synthesize a natural speech. LSTM guided unit selection synthesis system have achieved state-of-the-art performance in statistical parametric speech synthesis (SPSS) system due to its deep architecture and capacities to long-term dependencies across the linguistic features, which HMM doesn't possess [7,8] confirms the objective result that LSTM can better model acoustic features than DNN. So, LSTM is adopted for acoustic modeling and duration modeling. And, the current mainstream hybrid system is applied to synthesize speech, in which CART decision trees take part in the unit pre-selection and DTW is being used for selecting optimal unit. At last a new fade-in/out method taking into account amplitude is adopted in waveforms concatenation to improve smoothing. The paper is organized as follows. Section 2 discusses the preprocessing techniques for the speech corpus that is used in our method. Section 3 introduces the approach about the LSTM and CRF-based parametric synthesis system and describes an improved algorithm of unit concatenation smoothing. Objective tests and evaluation is presented in section 4. Section 5 is conclusion. International Conference on Modeling, Analysis, Simulation Technologies and Applications (MASTA 2019) Copyright © 2019, the Authors. Published by Atlantis Press. This is an open access article under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/). Advances in Intelligent Systems Research, volume 168","PeriodicalId":103896,"journal":{"name":"Proceedings of the 2019 International Conference on Modeling, Analysis, Simulation Technologies and Applications (MASTA 2019)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Novel Unit Selection and Unit Smoothing Method for Chinese Concatenation Speech\",\"authors\":\"Xiaoxia Yang, Zhi-cheng Liu, Qilong Sun, Hao Wang\",\"doi\":\"10.2991/MASTA-19.2019.47\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper introduces a new approach to unit selection and unit concatenation, in which Chinese character is the smallest unit in speech corpus and at concatenation stage, speech segments are not only concatenated in phase, but also in amplitude. A conventional hybrid system is used in this paper. Firstly, LSTM were adopted for acoustic model and duration model, and prosody is predicted by Conditional Random Fields (CRFs). Secondly, without considering continuously-valued cost, we use Dynamic Time Warping (DTW) directly to select units with acoustic features such as mel-cepstrum and Fundamental Frequency (F0). At last, an improved cross-fade method taking amplitude into account is adopted in waveform concatenation to improve smoothing and very natural speech is synthesized. Introduction The recent rise of deep neural networks (DNNs) has brought an increase in performance in both automatic speech recognition (ASR) statistical text-to-speech (TTS) technology [1]. Parameter synthesis system and unit concatenation synthesis system as two mainstream speech synthesis systems also have been well-developed because of DNNs. Not like parameter synthesis method using vocoder such as WORLD to synthesize speech, concatenation synthesis method select proper units from multiple instances to achieve better flexibility quality in prosody and timbre [2]. Tonal syllable is considered to be the basic units in synthesis system for Mandarin, because of very strong co-articulation between phonemes in a same syllable and much less concatenation points between syllables [2]. Some researchers find the juncture between phonemes, between syllables, between rhythm units and between stops in Chinese, between phonemes are the strongest, the one between syllables is next stronger and the others are weaker [3,4] trains a group of syllable classifiers, which takes not continuously-valued cost into each candidate unit. [5,6] also suggest that eliminating the unacceptable units is crucial to synthesize a natural speech. LSTM guided unit selection synthesis system have achieved state-of-the-art performance in statistical parametric speech synthesis (SPSS) system due to its deep architecture and capacities to long-term dependencies across the linguistic features, which HMM doesn't possess [7,8] confirms the objective result that LSTM can better model acoustic features than DNN. So, LSTM is adopted for acoustic modeling and duration modeling. And, the current mainstream hybrid system is applied to synthesize speech, in which CART decision trees take part in the unit pre-selection and DTW is being used for selecting optimal unit. At last a new fade-in/out method taking into account amplitude is adopted in waveforms concatenation to improve smoothing. The paper is organized as follows. Section 2 discusses the preprocessing techniques for the speech corpus that is used in our method. Section 3 introduces the approach about the LSTM and CRF-based parametric synthesis system and describes an improved algorithm of unit concatenation smoothing. Objective tests and evaluation is presented in section 4. Section 5 is conclusion. International Conference on Modeling, Analysis, Simulation Technologies and Applications (MASTA 2019) Copyright © 2019, the Authors. Published by Atlantis Press. This is an open access article under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/). Advances in Intelligent Systems Research, volume 168\",\"PeriodicalId\":103896,\"journal\":{\"name\":\"Proceedings of the 2019 International Conference on Modeling, Analysis, Simulation Technologies and Applications (MASTA 2019)\",\"volume\":\"18 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2019 International Conference on Modeling, Analysis, Simulation Technologies and Applications (MASTA 2019)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2991/MASTA-19.2019.47\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2019 International Conference on Modeling, Analysis, Simulation Technologies and Applications (MASTA 2019)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2991/MASTA-19.2019.47","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
A Novel Unit Selection and Unit Smoothing Method for Chinese Concatenation Speech
This paper introduces a new approach to unit selection and unit concatenation, in which Chinese character is the smallest unit in speech corpus and at concatenation stage, speech segments are not only concatenated in phase, but also in amplitude. A conventional hybrid system is used in this paper. Firstly, LSTM were adopted for acoustic model and duration model, and prosody is predicted by Conditional Random Fields (CRFs). Secondly, without considering continuously-valued cost, we use Dynamic Time Warping (DTW) directly to select units with acoustic features such as mel-cepstrum and Fundamental Frequency (F0). At last, an improved cross-fade method taking amplitude into account is adopted in waveform concatenation to improve smoothing and very natural speech is synthesized. Introduction The recent rise of deep neural networks (DNNs) has brought an increase in performance in both automatic speech recognition (ASR) statistical text-to-speech (TTS) technology [1]. Parameter synthesis system and unit concatenation synthesis system as two mainstream speech synthesis systems also have been well-developed because of DNNs. Not like parameter synthesis method using vocoder such as WORLD to synthesize speech, concatenation synthesis method select proper units from multiple instances to achieve better flexibility quality in prosody and timbre [2]. Tonal syllable is considered to be the basic units in synthesis system for Mandarin, because of very strong co-articulation between phonemes in a same syllable and much less concatenation points between syllables [2]. Some researchers find the juncture between phonemes, between syllables, between rhythm units and between stops in Chinese, between phonemes are the strongest, the one between syllables is next stronger and the others are weaker [3,4] trains a group of syllable classifiers, which takes not continuously-valued cost into each candidate unit. [5,6] also suggest that eliminating the unacceptable units is crucial to synthesize a natural speech. LSTM guided unit selection synthesis system have achieved state-of-the-art performance in statistical parametric speech synthesis (SPSS) system due to its deep architecture and capacities to long-term dependencies across the linguistic features, which HMM doesn't possess [7,8] confirms the objective result that LSTM can better model acoustic features than DNN. So, LSTM is adopted for acoustic modeling and duration modeling. And, the current mainstream hybrid system is applied to synthesize speech, in which CART decision trees take part in the unit pre-selection and DTW is being used for selecting optimal unit. At last a new fade-in/out method taking into account amplitude is adopted in waveforms concatenation to improve smoothing. The paper is organized as follows. Section 2 discusses the preprocessing techniques for the speech corpus that is used in our method. Section 3 introduces the approach about the LSTM and CRF-based parametric synthesis system and describes an improved algorithm of unit concatenation smoothing. Objective tests and evaluation is presented in section 4. Section 5 is conclusion. International Conference on Modeling, Analysis, Simulation Technologies and Applications (MASTA 2019) Copyright © 2019, the Authors. Published by Atlantis Press. This is an open access article under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/). Advances in Intelligent Systems Research, volume 168