Speech Synthesis Workshop最新文献

Archiving pushed Inferences from Sensor Data Streams 归档从传感器数据流推送推断

Speech Synthesis Workshop Pub Date : 2018-04-09 DOI: 10.5220/0003116000380046

J. Brunsmann

引用次数: 5

Parallel and cascaded deep neural networks for text-to-speech synthesis 用于文本到语音合成的并行和级联深度神经网络

Speech Synthesis Workshop Pub Date : 2016-09-15 DOI: 10.21437/SSW.2016-17

M. Ribeiro, O. Watts, J. Yamagishi

引用次数: 7

Merlin: An Open Source Neural Network Speech Synthesis System 一个开源的神经网络语音合成系统

Speech Synthesis Workshop Pub Date : 2016-09-15 DOI: 10.21437/SSW.2016-33

Zhizheng Wu, O. Watts, Simon King

引用次数: 320

A Comparative Study of the Performance of HMM, DNN, and RNN based Speech Synthesis Systems Trained on Very Large Speaker-Dependent Corpora 基于HMM、DNN和RNN的超大型说话人相关语料库语音合成系统性能比较研究

Speech Synthesis Workshop Pub Date : 2016-09-15 DOI: 10.21437/SSW.2016-20

Xin Wang, Shinji Takaki, J. Yamagishi

引用次数: 14

Wideband Harmonic Model: Alignment and Noise Modeling for High Quality Speech Synthesis 宽带谐波模型:高质量语音合成的对准和噪声建模

Speech Synthesis Workshop Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-37

Slava Shechtman, A. Sorin

{"title":"Wideband Harmonic Model: Alignment and Noise Modeling for High Quality Speech Synthesis","authors":"Slava Shechtman, A. Sorin","doi":"10.21437/SSW.2016-37","DOIUrl":"https://doi.org/10.21437/SSW.2016-37","url":null,"abstract":"Speech sinusoidal modeling has been successfully applied to a broad range of speech analysis, synthesis and modification tasks. However, developing a high fidelity full band sinusoidal model that preserves its high quality on speech transformation still remains an open research problem. Such a system can be extremely useful for high quality speech synthesis. In this paper we present an enhanced harmonic model representation for voiced/mixed wide band speech that is capable of high quality speech reconstruction and transformation in the parametric domain. Two key elements of the proposed model are a proper phase alignment and a decomposition of a speech frame to \"deterministic\" and dense \"stochastic\" harmonic model representations that can be separately manipulated. The coupling of stochastic harmonic representation with the deterministic one is performed by means of intra-frame periodic energy envelope, estimated at analysis time and preserved during original/transformed speech reconstruction. In addition, we present a compact representation of the stochastic harmonic component, so that the proposed model has less parameters than the regular full band harmonic model, with better Signal to Reconstruction Error performance. On top of that, the improved phase alignment of the proposed model provides better phase coherency in transformed speech, resulting in better quality of speech transformations. We demonstrate the subjective and objective performance of the new model on speech reconstruction and pitch modification tasks. Performance of the proposed model within unit selection TTS is also presented.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122785255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Synthesising Filled Pauses: Representation and Datamixing 合成填充停顿:表示和数据混合

Speech Synthesis Workshop Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-2

R. Dall, M. Tomalin, M. Wester

{"title":"Synthesising Filled Pauses: Representation and Datamixing","authors":"R. Dall, M. Tomalin, M. Wester","doi":"10.21437/SSW.2016-2","DOIUrl":"https://doi.org/10.21437/SSW.2016-2","url":null,"abstract":"Filled pauses occur frequently in spontaneous human speech, yet modern text-to-speech synthesis systems rarely model these disﬂuencies overtly, and consequently they do not output convincing synthetic ﬁlled pauses. This paper presents a text-to-speech system that is speciﬁcally designed to model these particular disﬂuencies more efffectively. A preparatory investigation shows that a synthetic voice trained exclusively on spontaneous speech is perceived to be inferior in quality to a voice trained entirely on read speech, even though the latter does not handle ﬁlled pauses well. This motivates an investigation into the phonetic representation of ﬁlled pauses which show that, in a preference test, the use of a distinct phone for ﬁlled pauses is preferred over the standard /V/ phone and the alternative /@/ phone. In addition, we present a variety of data-mixing techniques to combine the strengths of standard synthesis systems trained on read speech corpora with the supplementary advantages offered by systems trained on spontaneous speech. In a MUSHRA-style test, it is found that the best overall quality is obtained by combining the two types of corpora using a source marking technique. Speciﬁcally, general speech is synthesised with a standard mark, while ﬁlled pauses are synthesised with a spontaneous mark, which has the added beneﬁt of also producing ﬁlled pauses that are comparatively well synthesised.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126110168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

How to select a good voice for TTS 如何为TTS选择一个好的声音

Speech Synthesis Workshop Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-15

Sunhee Kim

{"title":"How to select a good voice for TTS","authors":"Sunhee Kim","doi":"10.21437/SSW.2016-15","DOIUrl":"https://doi.org/10.21437/SSW.2016-15","url":null,"abstract":"Even though the perceived quality of a speaker’s natural voice does not necessarily guarantee the quality of synthesized speech, it is required to select a certain number of candidates based on their natural voice before moving to the evaluation stage of synthesized sentences. This paper describes a male speaker selection procedure for unit selection synthesis systems in English and Japanese based on perceptive evaluation and acoustic measurements of the speakers’ natural voice. A perceptive evaluation is performed on eight professional voice talents of each language. A total of twenty native-speaker listeners are recruited in both languages and each listener is asked to rate on eight analytical factors by using a five-scale score and rank three best speakers. Acoustic measurement focuses on the voice quality by extracting two measures from Long Term Average Spectrum (LTAS), the so-called Speakers Formant (SPF), which corresponds to the peak intensity between 3 kHz and 4 kHz, and the Alpha ratio (AR), which is the lower level difference between 0 and 1 kHz and 1 and 4 kHz ranges. The perceptive evaluation results show a very strong correlation between the total score and the preference in both languages, 0.9183 in English and 0.8589 in Japanese. The correlations between the perceptive evaluation and acoustic measurements are moderate with respect to SPF and AR, 0.473 and -0.494 in English, and 0.288 and -0.263 in Japanese.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130257527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Prediction of Emotions from Text using Sentiment Analysis for Expressive Speech Synthesis 基于情感分析的文本情感预测及其表达性语音合成

Speech Synthesis Workshop Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-4

Eva Vanmassenhove, João P. Cabral, F. Haider

{"title":"Prediction of Emotions from Text using Sentiment Analysis for Expressive Speech Synthesis","authors":"Eva Vanmassenhove, João P. Cabral, F. Haider","doi":"10.21437/SSW.2016-4","DOIUrl":"https://doi.org/10.21437/SSW.2016-4","url":null,"abstract":"The generation of expressive speech is a great challenge for text-to-speech synthesis in audiobooks. One of the most important factors is the variation in speech emotion or voice style. In this work, we developed a method to predict the emotion from a sentence so that we can convey it through the synthetic voice. It consists of combining a standard emotion-lexicon based technique with the polarity-scores (positive/negative polarity) provided by a less ﬁne-grained sentiment analysis tool, in order to get more accurate emotion-labels. The primary goal of this emotion prediction tool was to select the type of voice (one of the emotions or neutral) given the input sentence to a state-of-the-art HMM-based Text-to-Speech (TTS) system. In addition, we also combined the emotion prediction from text with a speech clustering method to select the utterances with emotion during the process of building the emotional corpus for the speech synthesizer. Speech clustering is a popular approach to divide the speech data into subsets associated with different voice styles. The challenge here is to determine the clusters that map out the basic emotions from an audiobook corpus that contains high variety of speaking styles, in a way that minimizes the need for human annotation. The evaluation of emotion classiﬁcation from text showed that, in general, our system can obtain accuracy results close to that of human annotators. Results also indicate that this technique is useful in the selection of utterances with emotion for building expressive synthetic voices.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117295652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Experiments with Cross-lingual Systems for Synthesis of Code-Mixed Text 代码混合文本合成的跨语言系统实验

Speech Synthesis Workshop Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-13

Sunayana Sitaram, Sai Krishna Rallabandi, Shruti Rijhwani, A. Black

{"title":"Experiments with Cross-lingual Systems for Synthesis of Code-Mixed Text","authors":"Sunayana Sitaram, Sai Krishna Rallabandi, Shruti Rijhwani, A. Black","doi":"10.21437/SSW.2016-13","DOIUrl":"https://doi.org/10.21437/SSW.2016-13","url":null,"abstract":"Most Text to Speech (TTS) systems today assume that the input is in a single language written in its native script, which is the language that the TTS database is recorded in. However, due to the rise in conversational data available from social media, phenomena such as code-mixing, in which multiple languages are used together in the same conversation or sentence are now seen in text. TTS systems capable of synthesizing such text need to be able to handle multiple languages at the same time, and may also need to deal with noisy input. Previously, we proposed a framework to synthesize code-mixed text by using a TTS database in a single language, identifying the language that each word was from, normalizing spellings of a language written in a non-standardized script and mapping the phonetic space of mixed language to the language that the TTS database was recorded in. We extend this cross-lingual approach to more language pairs, and improve upon our language identification technique. We conduct listening tests to determine which of the two languages being mixed should be used as the target language. We perform experiments for code-mixed Hindi-English and German-English and conduct listening tests with bilingual speakers of these languages. From our subjective experiments we find that listeners have a strong preference for cross-lingual systems with Hindi as the target language for code-mixed Hindi and English text. We also find that listeners prefer cross-lingual systems in English that can synthesize German text for codemixed German and English text.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131490180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 32

Speaker Adaptation of Various Components in Deep Neural Network based Speech Synthesis 基于深度神经网络的语音合成中不同成分的说话人自适应

Speech Synthesis Workshop Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-25

Shinji Takaki, Sangjin Kim, J. Yamagishi

引用次数: 8