Speech Synthesis Workshop最新文献

筛选
英文 中文
Archiving pushed Inferences from Sensor Data Streams 归档从传感器数据流推送推断
Speech Synthesis Workshop Pub Date : 2018-04-09 DOI: 10.5220/0003116000380046
J. Brunsmann
{"title":"Archiving pushed Inferences from Sensor Data Streams","authors":"J. Brunsmann","doi":"10.5220/0003116000380046","DOIUrl":"https://doi.org/10.5220/0003116000380046","url":null,"abstract":"Although pervasively deployed, sensors are currently neither highly interconnected nor very intelligent, since they do not know each other and produce only raw data streams. This lack of interoperability and high-level reasoning capabilities are major obstacles for exploiting the full potential of sensor data streams. Since interoperability and reasoning processes require a common understanding, RDF based linked sensor data is used in the semantic sensor web to articulate the meaning of sensor data. This paper shows how to derive higher levels of streamed sensor data understanding by constructing reasoning knowledge with SPARQL. In addition, it is demonstrated how to push these inferences to interested clients in different application domains like social media streaming, weather observation and intelligent product lifecycle maintenance. Finally, the paper describes how real-time pushing of inferences enables provenance tracking and how archiving of inferred events could support further decision making processes.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126596096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Parallel and cascaded deep neural networks for text-to-speech synthesis 用于文本到语音合成的并行和级联深度神经网络
Speech Synthesis Workshop Pub Date : 2016-09-15 DOI: 10.21437/SSW.2016-17
M. Ribeiro, O. Watts, J. Yamagishi
{"title":"Parallel and cascaded deep neural networks for text-to-speech synthesis","authors":"M. Ribeiro, O. Watts, J. Yamagishi","doi":"10.21437/SSW.2016-17","DOIUrl":"https://doi.org/10.21437/SSW.2016-17","url":null,"abstract":"An investigation of cascaded and parallel deep neural networks for speech synthesis is conducted. In these systems, suprasegmental linguistic features (syllable-level and above) are processed separately from segmental features (phone-level and below). The suprasegmental component of the networks learns compact distributed representations of high-level linguistic units without any segmental influence. These representations are then integrated into a frame-level system using a cascaded or a parallel approach. In the cascaded network, suprasegmental representations are used as input to the framelevel network. In the parallel network, segmental and suprasegmental features are processed separately and concatenated at a later stage. These experiments are conducted with a standard set of high-dimensional linguistic features as well as a hand-pruned one. It is observed that hierarchical systems are consistently preferred over the baseline feedforward systems. Similarly, parallel networks are preferred over cascaded networks.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121952802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Merlin: An Open Source Neural Network Speech Synthesis System 一个开源的神经网络语音合成系统
Speech Synthesis Workshop Pub Date : 2016-09-15 DOI: 10.21437/SSW.2016-33
Zhizheng Wu, O. Watts, Simon King
{"title":"Merlin: An Open Source Neural Network Speech Synthesis System","authors":"Zhizheng Wu, O. Watts, Simon King","doi":"10.21437/SSW.2016-33","DOIUrl":"https://doi.org/10.21437/SSW.2016-33","url":null,"abstract":"We introduce the Merlin speech synthesis toolkit for neural network-based speech synthesis. The system takes linguistic features as input, and employs neural networks to predict acoustic features, which are then passed to a vocoder to produce the speech waveform. Various neural network architectures are implemented, including a standard feedforward neural network, mixture density neural network, recurrent neural network (RNN), long short-term memory (LSTM) recurrent neural network, amongst others. The toolkit is Open Source, written in Python, and is extensible. This paper briefly describes the system, and provides some benchmarking results on a freely-available corpus.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125094344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 320
A Comparative Study of the Performance of HMM, DNN, and RNN based Speech Synthesis Systems Trained on Very Large Speaker-Dependent Corpora 基于HMM、DNN和RNN的超大型说话人相关语料库语音合成系统性能比较研究
Speech Synthesis Workshop Pub Date : 2016-09-15 DOI: 10.21437/SSW.2016-20
Xin Wang, Shinji Takaki, J. Yamagishi
{"title":"A Comparative Study of the Performance of HMM, DNN, and RNN based Speech Synthesis Systems Trained on Very Large Speaker-Dependent Corpora","authors":"Xin Wang, Shinji Takaki, J. Yamagishi","doi":"10.21437/SSW.2016-20","DOIUrl":"https://doi.org/10.21437/SSW.2016-20","url":null,"abstract":"This study investigates the impact of the amount of training data on the performance of parametric speech synthesis systems. A Japanese corpus with 100 hours’ audio recordings of a male voice and another corpus with 50 hours’ recordings of a female voice were utilized to train systems based on hidden Markov model (HMM), feed-forward neural network and recurrent neural network (RNN). The results show that the improvement on the accuracy of the predicted spectral features gradually diminishes as the amount of training data increases. However, different from the “diminishing returns” in the spectral stream, the accuracy of the predicted F0 trajectory by the HMM and RNN systems tends to consistently benefit from the increasing amount of training data.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134506279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Wideband Harmonic Model: Alignment and Noise Modeling for High Quality Speech Synthesis 宽带谐波模型:高质量语音合成的对准和噪声建模
Speech Synthesis Workshop Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-37
Slava Shechtman, A. Sorin
{"title":"Wideband Harmonic Model: Alignment and Noise Modeling for High Quality Speech Synthesis","authors":"Slava Shechtman, A. Sorin","doi":"10.21437/SSW.2016-37","DOIUrl":"https://doi.org/10.21437/SSW.2016-37","url":null,"abstract":"Speech sinusoidal modeling has been successfully applied to a broad range of speech analysis, synthesis and modification tasks. However, developing a high fidelity full band sinusoidal model that preserves its high quality on speech transformation still remains an open research problem. Such a system can be extremely useful for high quality speech synthesis. In this paper we present an enhanced harmonic model representation for voiced/mixed wide band speech that is capable of high quality speech reconstruction and transformation in the parametric domain. Two key elements of the proposed model are a proper phase alignment and a decomposition of a speech frame to \"deterministic\" and dense \"stochastic\" harmonic model representations that can be separately manipulated. The coupling of stochastic harmonic representation with the deterministic one is performed by means of intra-frame periodic energy envelope, estimated at analysis time and preserved during original/transformed speech reconstruction. In addition, we present a compact representation of the stochastic harmonic component, so that the proposed model has less parameters than the regular full band harmonic model, with better Signal to Reconstruction Error performance. On top of that, the improved phase alignment of the proposed model provides better phase coherency in transformed speech, resulting in better quality of speech transformations. We demonstrate the subjective and objective performance of the new model on speech reconstruction and pitch modification tasks. Performance of the proposed model within unit selection TTS is also presented.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122785255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Synthesising Filled Pauses: Representation and Datamixing 合成填充停顿:表示和数据混合
Speech Synthesis Workshop Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-2
R. Dall, M. Tomalin, M. Wester
{"title":"Synthesising Filled Pauses: Representation and Datamixing","authors":"R. Dall, M. Tomalin, M. Wester","doi":"10.21437/SSW.2016-2","DOIUrl":"https://doi.org/10.21437/SSW.2016-2","url":null,"abstract":"Filled pauses occur frequently in spontaneous human speech, yet modern text-to-speech synthesis systems rarely model these disfluencies overtly, and consequently they do not output convincing synthetic filled pauses. This paper presents a text-to-speech system that is specifically designed to model these particular disfluencies more efffectively. A preparatory investigation shows that a synthetic voice trained exclusively on spontaneous speech is perceived to be inferior in quality to a voice trained entirely on read speech, even though the latter does not handle filled pauses well. This motivates an investigation into the phonetic representation of filled pauses which show that, in a preference test, the use of a distinct phone for filled pauses is preferred over the standard /V/ phone and the alternative /@/ phone. In addition, we present a variety of data-mixing techniques to combine the strengths of standard synthesis systems trained on read speech corpora with the supplementary advantages offered by systems trained on spontaneous speech. In a MUSHRA-style test, it is found that the best overall quality is obtained by combining the two types of corpora using a source marking technique. Specifically, general speech is synthesised with a standard mark, while filled pauses are synthesised with a spontaneous mark, which has the added benefit of also producing filled pauses that are comparatively well synthesised.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126110168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
How to select a good voice for TTS 如何为TTS选择一个好的声音
Speech Synthesis Workshop Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-15
Sunhee Kim
{"title":"How to select a good voice for TTS","authors":"Sunhee Kim","doi":"10.21437/SSW.2016-15","DOIUrl":"https://doi.org/10.21437/SSW.2016-15","url":null,"abstract":"Even though the perceived quality of a speaker’s natural voice does not necessarily guarantee the quality of synthesized speech, it is required to select a certain number of candidates based on their natural voice before moving to the evaluation stage of synthesized sentences. This paper describes a male speaker selection procedure for unit selection synthesis systems in English and Japanese based on perceptive evaluation and acoustic measurements of the speakers’ natural voice. A perceptive evaluation is performed on eight professional voice talents of each language. A total of twenty native-speaker listeners are recruited in both languages and each listener is asked to rate on eight analytical factors by using a five-scale score and rank three best speakers. Acoustic measurement focuses on the voice quality by extracting two measures from Long Term Average Spectrum (LTAS), the so-called Speakers Formant (SPF), which corresponds to the peak intensity between 3 kHz and 4 kHz, and the Alpha ratio (AR), which is the lower level difference between 0 and 1 kHz and 1 and 4 kHz ranges. The perceptive evaluation results show a very strong correlation between the total score and the preference in both languages, 0.9183 in English and 0.8589 in Japanese. The correlations between the perceptive evaluation and acoustic measurements are moderate with respect to SPF and AR, 0.473 and -0.494 in English, and 0.288 and -0.263 in Japanese.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130257527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Prediction of Emotions from Text using Sentiment Analysis for Expressive Speech Synthesis 基于情感分析的文本情感预测及其表达性语音合成
Speech Synthesis Workshop Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-4
Eva Vanmassenhove, João P. Cabral, F. Haider
{"title":"Prediction of Emotions from Text using Sentiment Analysis for Expressive Speech Synthesis","authors":"Eva Vanmassenhove, João P. Cabral, F. Haider","doi":"10.21437/SSW.2016-4","DOIUrl":"https://doi.org/10.21437/SSW.2016-4","url":null,"abstract":"The generation of expressive speech is a great challenge for text-to-speech synthesis in audiobooks. One of the most important factors is the variation in speech emotion or voice style. In this work, we developed a method to predict the emotion from a sentence so that we can convey it through the synthetic voice. It consists of combining a standard emotion-lexicon based technique with the polarity-scores (positive/negative polarity) provided by a less fine-grained sentiment analysis tool, in order to get more accurate emotion-labels. The primary goal of this emotion prediction tool was to select the type of voice (one of the emotions or neutral) given the input sentence to a state-of-the-art HMM-based Text-to-Speech (TTS) system. In addition, we also combined the emotion prediction from text with a speech clustering method to select the utterances with emotion during the process of building the emotional corpus for the speech synthesizer. Speech clustering is a popular approach to divide the speech data into subsets associated with different voice styles. The challenge here is to determine the clusters that map out the basic emotions from an audiobook corpus that contains high variety of speaking styles, in a way that minimizes the need for human annotation. The evaluation of emotion classification from text showed that, in general, our system can obtain accuracy results close to that of human annotators. Results also indicate that this technique is useful in the selection of utterances with emotion for building expressive synthetic voices.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117295652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Experiments with Cross-lingual Systems for Synthesis of Code-Mixed Text 代码混合文本合成的跨语言系统实验
Speech Synthesis Workshop Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-13
Sunayana Sitaram, Sai Krishna Rallabandi, Shruti Rijhwani, A. Black
{"title":"Experiments with Cross-lingual Systems for Synthesis of Code-Mixed Text","authors":"Sunayana Sitaram, Sai Krishna Rallabandi, Shruti Rijhwani, A. Black","doi":"10.21437/SSW.2016-13","DOIUrl":"https://doi.org/10.21437/SSW.2016-13","url":null,"abstract":"Most Text to Speech (TTS) systems today assume that the input is in a single language written in its native script, which is the language that the TTS database is recorded in. However, due to the rise in conversational data available from social media, phenomena such as code-mixing, in which multiple languages are used together in the same conversation or sentence are now seen in text. TTS systems capable of synthesizing such text need to be able to handle multiple languages at the same time, and may also need to deal with noisy input. Previously, we proposed a framework to synthesize code-mixed text by using a TTS database in a single language, identifying the language that each word was from, normalizing spellings of a language written in a non-standardized script and mapping the phonetic space of mixed language to the language that the TTS database was recorded in. We extend this cross-lingual approach to more language pairs, and improve upon our language identification technique. We conduct listening tests to determine which of the two languages being mixed should be used as the target language. We perform experiments for code-mixed Hindi-English and German-English and conduct listening tests with bilingual speakers of these languages. From our subjective experiments we find that listeners have a strong preference for cross-lingual systems with Hindi as the target language for code-mixed Hindi and English text. We also find that listeners prefer cross-lingual systems in English that can synthesize German text for codemixed German and English text.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131490180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 32
Contextual Representation using Recurrent Neural Network Hidden State for Statistical Parametric Speech Synthesis 基于递归神经网络隐藏状态的上下文表示用于统计参数语音合成
Speech Synthesis Workshop Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-28
Sivanand Achanta, Rambabu Banoth, Ayushi Pandey, Anandaswarup Vadapalli, S. Gangashetty
{"title":"Contextual Representation using Recurrent Neural Network Hidden State for Statistical Parametric Speech Synthesis","authors":"Sivanand Achanta, Rambabu Banoth, Ayushi Pandey, Anandaswarup Vadapalli, S. Gangashetty","doi":"10.21437/SSW.2016-28","DOIUrl":"https://doi.org/10.21437/SSW.2016-28","url":null,"abstract":"In this paper, we propose to use hidden state vector ob-tained from recurrent neural network (RNN) as a context vector representation for deep neural network (DNN) based statistical parametric speech synthesis. While in a typical DNN based system, there is a hierarchy of text features from phone level to utterance level, they are usually in 1-hot-k encoded representation. Our hypothesis is that, supplementing the conventional text features with a continuous frame-level acoustically guided representation would improve the acoustic modeling. The hidden state from an RNN trained to predict acoustic features is used as the additional contextual information. A dataset consisting of 2 Indian languages (Telugu and Hindi) from Blizzard challenge 2015 was used in our experiments. Both the subjective listening tests and the objective scores indicate that the proposed approach per-forms significantly better than the baseline DNN system.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129464246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信