Speech Synthesis Workshop最新文献

筛选
英文 中文
Speaker Adaptation of Various Components in Deep Neural Network based Speech Synthesis 基于深度神经网络的语音合成中不同成分的说话人自适应
Speech Synthesis Workshop Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-25
Shinji Takaki, Sangjin Kim, J. Yamagishi
{"title":"Speaker Adaptation of Various Components in Deep Neural Network based Speech Synthesis","authors":"Shinji Takaki, Sangjin Kim, J. Yamagishi","doi":"10.21437/SSW.2016-25","DOIUrl":"https://doi.org/10.21437/SSW.2016-25","url":null,"abstract":"In this paper, we investigate the effectiveness of speaker adaptation for various essential components in deep neural network based speech synthesis, including acoustic models, acoustic feature extraction, and post-filters. In general, a speaker adaptation technique, e.g., maximum likelihood linear regression (MLLR) for HMMs or learning hidden unit contributions (LHUC) for DNNs, is applied to an acoustic modeling part to change voice characteristics or speaking styles. However, since we have proposed a multiple DNN-based speech synthesis system, in which several components are represented based on feed-forward DNNs, a speaker adaptation technique can be applied not only to the acoustic modeling part but also to other components represented by DNNs. In experiments using a small amount of adaptation data, we performed adaptation based on LHUC and simple additional fine tuning for DNN-based acoustic models, deep auto-encoder based feature extraction, and DNN-based post-filter models and compared them with HMM-based speech synthesis systems using MLLR.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133798947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
WikiSpeech - enabling open source text-to-speech for Wikipedia WikiSpeech -为维基百科启用开源文本到语音
Speech Synthesis Workshop Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-16
J. Andersson, S. Berlin, André Costa, Harald Berthelsen, Hanna Lindgren, N. Lindberg, J. Beskow, Jens Edlund, Joakim Gustafson
{"title":"WikiSpeech - enabling open source text-to-speech for Wikipedia","authors":"J. Andersson, S. Berlin, André Costa, Harald Berthelsen, Hanna Lindgren, N. Lindberg, J. Beskow, Jens Edlund, Joakim Gustafson","doi":"10.21437/SSW.2016-16","DOIUrl":"https://doi.org/10.21437/SSW.2016-16","url":null,"abstract":"We present WikiSpeech, an ambitious joint project aiming to (1) make open source text-to-speech available through Wikimedia Foundation’s server architecture; (2) utilize the large and active Wikipedia user base to achieve continuously improving text-to-speech; (3) improve existing and develop new crowdsourcing methods for text-to-speech; and (4) develop new and adapt current evaluation methods so that they are well suited for the particular use case of reading Wikipedia articles out loud while at the same time capable of harnessing the huge user base made available by Wikipedia. At its inauguration, the project is backed by The Swedish Post and Telecom Authority and headed by Wikimedia Sverige, STTS and KTH, but in the long run, the project aims at broad multinational involvement. The vision of the project is freely available text-to-speech for all Wikipedia languages (currently 293). In this paper, we present the project itself and its first steps: requirements, initial architecture, and initial steps to include crowdsourcing and evaluation.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123949288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Open-Source Consumer-Grade Indic Text To Speech 开源消费级索引文本到语音
Speech Synthesis Workshop Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-31
Andrew Wilkinson, A. Parlikar, Sunayana Sitaram, Tim White, A. Black, Suresh Bazaj
{"title":"Open-Source Consumer-Grade Indic Text To Speech","authors":"Andrew Wilkinson, A. Parlikar, Sunayana Sitaram, Tim White, A. Black, Suresh Bazaj","doi":"10.21437/SSW.2016-31","DOIUrl":"https://doi.org/10.21437/SSW.2016-31","url":null,"abstract":"Open-source text-to-speech (TTS) software has enabled the development of voices in multiple languages, including many high-resource languages, such as English and European languages. However, building voices for low-resource languages is still challenging. We describe the development of TTS systems for 12 Indian languages using the Festvox framework, for which we developed a common frontend for Indian languages. Voices for eight of these 12 languages are available for use with Flite, a lightweight, fast run-time synthesizer, and the Android Flite app available in the Google Play store. Recently, the baseline Punjabi TTS voice was built end-to-end in a month by two undergraduate students (without any prior knowledge of TTS) with help from two of the authors of this paper. The framework can be used to build a baseline Indic TTS voice in two weeks, once a text corpus is selected and a suitable native speaker is identified.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114595027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Prosodic and Spectral iVectors for Expressive Speech Synthesis 表达性语音合成的韵律和谱向量
Speech Synthesis Workshop Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-10
Igor Jauk, A. Bonafonte
{"title":"Prosodic and Spectral iVectors for Expressive Speech Synthesis","authors":"Igor Jauk, A. Bonafonte","doi":"10.21437/SSW.2016-10","DOIUrl":"https://doi.org/10.21437/SSW.2016-10","url":null,"abstract":"This work presents a study on the suitability of prosodic andacoustic features, with a special focus on i-vectors, in expressivespeech analysis and synthesis. For each utterance of two dif-ferent databases, a laboratory recorded emotional acted speech,and an audiobook, several prosodic and acoustic features are ex-tracted. Among them, i-vectors are built not only on the MFCCbase, but also on F0, power and syllable durations. Then, un-supervised clustering is performed using different feature com-binations. The resulting clusters are evaluated calculating clus-ter entropy for labeled portions of the databases. Additionally,synthetic voices are trained, applying speaker adaptive training,from the clusters built from the audiobook. The voices are eval-uated in a perceptual test where the participants have to edit anaudiobook paragraph using the synthetic voices.The objective results suggest that i-vectors are very use-ful for the audiobook, where different speakers (book charac-ters) are imitated. On the other hand, for the laboratory record-ings, traditional prosodic features outperform i-vectors. Also,a closer analysis of the created clusters suggest that differentspeakers use different prosodic and acoustic means to conveyemotions. The perceptual results suggest that the proposed i-vector based feature combinations can be used for audiobookclustering and voice training.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130595296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Nonaudible murmur enhancement based on statistical voice conversion and noise suppression with external noise monitoring 基于统计语音转换和外部噪声监测的噪声抑制的不可听杂音增强
Speech Synthesis Workshop Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-9
Y. Tajiri, T. Toda
{"title":"Nonaudible murmur enhancement based on statistical voice conversion and noise suppression with external noise monitoring","authors":"Y. Tajiri, T. Toda","doi":"10.21437/SSW.2016-9","DOIUrl":"https://doi.org/10.21437/SSW.2016-9","url":null,"abstract":"This paper presents a method for making nonaudible murmur (NAM) enhancement based on statistical voice conversion (VC) robust against external noise. NAM, which is an extremely soft whispered voice, is a promising medium for silent speech communication thanks to its faint volume. Although such a soft voice can still be detected with a special body-conductive microphone, its quality significantly degrades compared to that of air-conductive voices. It has been shown that the statistical VC technique is capable of significantly improving quality of NAM by converting it into the air-conductive voices. However, this technique is not helpful under noisy conditions because a detected NAM signal easily suffers from external noise, and acoustic mismatches are caused between such a noisy NAM signal and a previously trained conversion model. To address this issue, in this paper we apply our proposed noise suppression method based on external noise monitoring to the statistical NAM enhancement. Moreover, a known noise superimposition method is further applied in order to alleviate the effects of residual noise components on the conversion accuracy. The experimental results demonstrate that the proposed method yields significant improvements in the conversion accuracy compared to the conventional method.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114400296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Jerk Minimization for Acoustic-To-Articulatory Inversion 声学-发音反转的震动最小化
Speech Synthesis Workshop Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-14
Avni Rajpal, H. Patil
{"title":"Jerk Minimization for Acoustic-To-Articulatory Inversion","authors":"Avni Rajpal, H. Patil","doi":"10.21437/SSW.2016-14","DOIUrl":"https://doi.org/10.21437/SSW.2016-14","url":null,"abstract":"The effortless speech production in humans requires coordinated movements of the articulators such as lips, tongue, jaw, velum, etc. Therefore, measured trajectories obtained are smooth and slowly-varying. However, the trajectories estimated from acoustic-to-articulatory inversion (AAI) are found to be jagged . Thus, energy minimization is used as smoothness constraint for improving performance of the AAI. Besides energy minimization, jerk (i.e., rate of change of acceleration) is known for quantification of smoothness in case of human motor movements. Human motors are organized to achieve intended goal with smoothest possible movements, under the constraint of minimum accelerative transients. In this paper, we propose jerk minimization as an alternative smoothness criterion for frame-based acoustic-to-articulatory inversion. The resultant trajectories obtained are smooth in the sense that for articulator-specific window size, they will have minimum jerk. The results using this criterion were found to be comparable with inversion schemes based on existing energy minimization criteria for achieving smoothness.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117138581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Multi-output RNN-LSTM for multiple speaker speech synthesis with α-interpolation model 基于α-插值模型的多输出RNN-LSTM多扬声器语音合成
Speech Synthesis Workshop Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-19
Santiago Pascual, A. Bonafonte
{"title":"Multi-output RNN-LSTM for multiple speaker speech synthesis with α-interpolation model","authors":"Santiago Pascual, A. Bonafonte","doi":"10.21437/SSW.2016-19","DOIUrl":"https://doi.org/10.21437/SSW.2016-19","url":null,"abstract":"Deep Learning has been applied successfully to speech processing. In this paper we propose an architecture for speech synthesis using multiple speakers. Some hidden layers are shared by all the speakers, while there is a specific output layer for each speaker. Objective and perceptual experiments prove that this scheme produces much better results in comparison with sin- \u0000gle speaker model. Moreover, we also tackle the problem of speaker interpolation by adding a new output layer (a-layer) on top of the multi-output branches. An identifying code is injected into the layer together with acoustic features of many speakers. Experiments show that the a-layer can effectively learn to interpolate the acoustic features between speakers.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127704141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Non-intrusive Quality Assessment of Synthesized Speech using Spectral Features and Support Vector Regression 基于谱特征和支持向量回归的合成语音非侵入性质量评估
Speech Synthesis Workshop Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-21
Meet H. Soni, H. Patil
{"title":"Non-intrusive Quality Assessment of Synthesized Speech using Spectral Features and Support Vector Regression","authors":"Meet H. Soni, H. Patil","doi":"10.21437/SSW.2016-21","DOIUrl":"https://doi.org/10.21437/SSW.2016-21","url":null,"abstract":"In this paper, we propose a new quality assessment method for synthesized speech. Unlike previous approaches which uses Hidden Markov Model (HMM) trained on natural utterances as a reference model to predict the quality of synthesized speech, proposed approach uses knowledge about synthesized speech while training the model. The previous approach has been successfully applied in the quality assessment of synthesized speech for the German language. However, it gave poor results for English language databases such as Blizzard Challenge 2008 and 2009 databases. The problem of quality assessment of synthesized speech is posed as a regression problem. The mapping between statistical properties of spectral features extracted from the speech signal and corresponding speech quality score (MOS) was found using Support Vector Regression (SVR). All the experiments were done on Blizzard Challenge Databases of the year 2008, 2009, 2010 and 2012. The results of experiments show that by including knowledge about synthesized speech while training, the performance of quality assessment system can be improved. Moreover, the accuracy of quality assessment system heavily depends on the kind of synthesis system used for signal generation. On Blizzard 2008 and 2009 database, proposed approach gives correlation of 0.28 and 0.49 , respectively, for about 17 % data used in training. Previous approach gives correlation of 0.3 and 0.09 , respectively, using spectral features. For Blizzard 2012 database, proposed approach gives correlation of 0.8 by using 12 % of available data in training.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116724346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Mandarin Prosodic Phrase Prediction based on Syntactic Trees 基于句法树的汉语韵律短语预测
Speech Synthesis Workshop Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-26
Zhengchen Zhang, Fuxiang Wu, Chenyu Yang, M. Dong, Fu-qiu Zhou
{"title":"Mandarin Prosodic Phrase Prediction based on Syntactic Trees","authors":"Zhengchen Zhang, Fuxiang Wu, Chenyu Yang, M. Dong, Fu-qiu Zhou","doi":"10.21437/SSW.2016-26","DOIUrl":"https://doi.org/10.21437/SSW.2016-26","url":null,"abstract":"Prosodic phrases (PPs) are important for Mandarin Text-To-Speech systems. Most of the existing PP detection methods need large manually annotated corpora to learn the models. In this paper, we propose a rule based method to predict the PP boundaries employing the syntactic information of a sentence. The method is based on the ob-servation that a prosodic phrase is a meaningful segment of a sentence with length restrictions. A syntactic structure allows to segment a sentence according to grammars. We add some length restrictions to the segmentations to predict the PP boundaries. An F-Score of 0.693 was obtained in the experiments, which is about 0.02 higher than the one got by a Conditional Random Field based method.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126297231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Automatic, model-based detection of pause-less phrase boundaries from fundamental frequency and duration features 从基本频率和持续时间特征中自动,基于模型的无停顿短语边界检测
Speech Synthesis Workshop Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-1
Mahsa Sadat Elyasi Langarani, J. V. Santen
{"title":"Automatic, model-based detection of pause-less phrase boundaries from fundamental frequency and duration features","authors":"Mahsa Sadat Elyasi Langarani, J. V. Santen","doi":"10.21437/SSW.2016-1","DOIUrl":"https://doi.org/10.21437/SSW.2016-1","url":null,"abstract":"Prosodic phrase boundaries (PBs) are a key aspect of spoken communication. In automatic PB detection, it is common to use local acoustic features, textual features, or a combination of both. Most approaches – regardless of features used – succeed in detecting major PBs (break score “4” in ToBI annotation, typically involving a pause) while detection of intermediate PBs (break score “3” in ToBI annotation) is still challenging. In this study we investigate the detection of intermediate, “pause-less” PBs using prosodic models, using a new corpus character-ized by strong prosodic dynamics and an existing (CMU) corpus. We show how using duration and fundamental frequency modeling can improve detection of these PBs, as measured by the F1 score, compared to Festival, which only uses textual features to detect PBs. We believe that this study contributes to our understanding of the prosody of phrase breaks.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"142 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116566992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信