A Novel Unit Selection and Unit Smoothing Method for Chinese Concatenation Speech

Proceedings of the 2019 International Conference on Modeling, Analysis, Simulation Technologies and Applications (MASTA 2019) Pub Date : 2019-07-01 DOI:10.2991/MASTA-19.2019.47

Xiaoxia Yang, Zhi-cheng Liu, Qilong Sun, Hao Wang

{"title":"A Novel Unit Selection and Unit Smoothing Method for Chinese Concatenation Speech","authors":"Xiaoxia Yang, Zhi-cheng Liu, Qilong Sun, Hao Wang","doi":"10.2991/MASTA-19.2019.47","DOIUrl":null,"url":null,"abstract":"This paper introduces a new approach to unit selection and unit concatenation, in which Chinese character is the smallest unit in speech corpus and at concatenation stage, speech segments are not only concatenated in phase, but also in amplitude. A conventional hybrid system is used in this paper. Firstly, LSTM were adopted for acoustic model and duration model, and prosody is predicted by Conditional Random Fields (CRFs). Secondly, without considering continuously-valued cost, we use Dynamic Time Warping (DTW) directly to select units with acoustic features such as mel-cepstrum and Fundamental Frequency (F0). At last, an improved cross-fade method taking amplitude into account is adopted in waveform concatenation to improve smoothing and very natural speech is synthesized. Introduction The recent rise of deep neural networks (DNNs) has brought an increase in performance in both automatic speech recognition (ASR) statistical text-to-speech (TTS) technology [1]. Parameter synthesis system and unit concatenation synthesis system as two mainstream speech synthesis systems also have been well-developed because of DNNs. Not like parameter synthesis method using vocoder such as WORLD to synthesize speech, concatenation synthesis method select proper units from multiple instances to achieve better flexibility quality in prosody and timbre [2]. Tonal syllable is considered to be the basic units in synthesis system for Mandarin, because of very strong co-articulation between phonemes in a same syllable and much less concatenation points between syllables [2]. Some researchers find the juncture between phonemes, between syllables, between rhythm units and between stops in Chinese, between phonemes are the strongest, the one between syllables is next stronger and the others are weaker [3,4] trains a group of syllable classifiers, which takes not continuously-valued cost into each candidate unit. [5,6] also suggest that eliminating the unacceptable units is crucial to synthesize a natural speech. LSTM guided unit selection synthesis system have achieved state-of-the-art performance in statistical parametric speech synthesis (SPSS) system due to its deep architecture and capacities to long-term dependencies across the linguistic features, which HMM doesn't possess [7,8] confirms the objective result that LSTM can better model acoustic features than DNN. So, LSTM is adopted for acoustic modeling and duration modeling. And, the current mainstream hybrid system is applied to synthesize speech, in which CART decision trees take part in the unit pre-selection and DTW is being used for selecting optimal unit. At last a new fade-in/out method taking into account amplitude is adopted in waveforms concatenation to improve smoothing. The paper is organized as follows. Section 2 discusses the preprocessing techniques for the speech corpus that is used in our method. Section 3 introduces the approach about the LSTM and CRF-based parametric synthesis system and describes an improved algorithm of unit concatenation smoothing. Objective tests and evaluation is presented in section 4. Section 5 is conclusion. International Conference on Modeling, Analysis, Simulation Technologies and Applications (MASTA 2019) Copyright © 2019, the Authors. Published by Atlantis Press. This is an open access article under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/). Advances in Intelligent Systems Research, volume 168","PeriodicalId":103896,"journal":{"name":"Proceedings of the 2019 International Conference on Modeling, Analysis, Simulation Technologies and Applications (MASTA 2019)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2019 International Conference on Modeling, Analysis, Simulation Technologies and Applications (MASTA 2019)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2991/MASTA-19.2019.47","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

This paper introduces a new approach to unit selection and unit concatenation, in which Chinese character is the smallest unit in speech corpus and at concatenation stage, speech segments are not only concatenated in phase, but also in amplitude. A conventional hybrid system is used in this paper. Firstly, LSTM were adopted for acoustic model and duration model, and prosody is predicted by Conditional Random Fields (CRFs). Secondly, without considering continuously-valued cost, we use Dynamic Time Warping (DTW) directly to select units with acoustic features such as mel-cepstrum and Fundamental Frequency (F0). At last, an improved cross-fade method taking amplitude into account is adopted in waveform concatenation to improve smoothing and very natural speech is synthesized. Introduction The recent rise of deep neural networks (DNNs) has brought an increase in performance in both automatic speech recognition (ASR) statistical text-to-speech (TTS) technology [1]. Parameter synthesis system and unit concatenation synthesis system as two mainstream speech synthesis systems also have been well-developed because of DNNs. Not like parameter synthesis method using vocoder such as WORLD to synthesize speech, concatenation synthesis method select proper units from multiple instances to achieve better flexibility quality in prosody and timbre [2]. Tonal syllable is considered to be the basic units in synthesis system for Mandarin, because of very strong co-articulation between phonemes in a same syllable and much less concatenation points between syllables [2]. Some researchers find the juncture between phonemes, between syllables, between rhythm units and between stops in Chinese, between phonemes are the strongest, the one between syllables is next stronger and the others are weaker [3,4] trains a group of syllable classifiers, which takes not continuously-valued cost into each candidate unit. [5,6] also suggest that eliminating the unacceptable units is crucial to synthesize a natural speech. LSTM guided unit selection synthesis system have achieved state-of-the-art performance in statistical parametric speech synthesis (SPSS) system due to its deep architecture and capacities to long-term dependencies across the linguistic features, which HMM doesn't possess [7,8] confirms the objective result that LSTM can better model acoustic features than DNN. So, LSTM is adopted for acoustic modeling and duration modeling. And, the current mainstream hybrid system is applied to synthesize speech, in which CART decision trees take part in the unit pre-selection and DTW is being used for selecting optimal unit. At last a new fade-in/out method taking into account amplitude is adopted in waveforms concatenation to improve smoothing. The paper is organized as follows. Section 2 discusses the preprocessing techniques for the speech corpus that is used in our method. Section 3 introduces the approach about the LSTM and CRF-based parametric synthesis system and describes an improved algorithm of unit concatenation smoothing. Objective tests and evaluation is presented in section 4. Section 5 is conclusion. International Conference on Modeling, Analysis, Simulation Technologies and Applications (MASTA 2019) Copyright © 2019, the Authors. Published by Atlantis Press. This is an open access article under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/). Advances in Intelligent Systems Research, volume 168

查看原文本刊更多论文

一种新的汉语拼接语音单元选择和平滑方法

本文介绍了一种新的单位选择和单位拼接方法，该方法将汉字作为语音语料库中最小的单位，在拼接阶段，语音片段不仅在相位上拼接，而且在振幅上拼接。本文采用的是传统的混合动力系统。首先，声学模型和音长模型采用LSTM模型，利用条件随机场(CRFs)预测韵律;其次，在不考虑连续值代价的情况下，直接使用动态时间翘曲(Dynamic Time Warping, DTW)选择具有梅尔倒谱和基频(F0)等声学特征的单元。最后，在波形拼接中采用一种考虑幅度的改进交叉渐隐方法提高平滑度，合成出非常自然的语音。近年来，深度神经网络(dnn)的兴起提高了自动语音识别(ASR)和文本到语音(TTS)技术的性能[1]。参数合成系统和单元拼接合成系统作为两种主流的语音合成系统也因为深度神经网络的出现而得到了很好的发展。与使用WORLD等声码器合成语音的参数合成方法不同，串联合成方法从多个实例中选择合适的单元，在韵律和音色上获得更好的灵活性质量[2]。声调音节被认为是普通话合成系统的基本单位，因为同一音节中的音素之间的共发音非常强，音节之间的连接点很少[2]。有研究者发现，汉语音素之间、音节之间、节奏单元之间、停顿之间的衔接，音素之间衔接最强，音节之间衔接次之，音节之间衔接较弱[3,4]，训练了一组音节分类器，这些音节分类器将非连续值代价纳入每个候选单元。[5,6]也表明，消除不可接受的单元对于合成自然语音至关重要。LSTM引导单元选择合成系统由于其深层架构和跨语言特征长期依赖的能力，在统计参数语音合成(SPSS)系统中取得了最先进的性能，这是HMM所不具备的[7,8]，证实了LSTM比DNN更能建模声学特征的客观结果。因此，采用LSTM进行声学建模和持续时间建模。将当前主流的混合系统应用于语音合成，其中CART决策树参与单元预选，DTW用于选择最优单元。最后，在波形拼接中采用了考虑幅度的渐入渐出方法，提高了平滑性。本文组织如下。第2节讨论了在我们的方法中使用的语音语料库的预处理技术。第3节介绍了基于LSTM和crf的参数综合系统的方法，并描述了一种改进的单元串联平滑算法。第4节介绍了客观测试和评价。第五部分是结论。建模、分析、仿真技术与应用国际会议(MASTA 2019)版权所有©2019，作者。亚特兰蒂斯出版社出版。这是一篇基于CC BY-NC许可(http://creativecommons.org/licenses/by-nc/4.0/)的开放获取文章。智能系统研究进展，第168卷

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2019 International Conference on Modeling, Analysis, Simulation Technologies and Applications (MASTA 2019)

自引率

0.00%

发文量