12th ISCA Speech Synthesis Workshop (SSW2023)最新文献

筛选
英文 中文
FiPPiE: A Computationally Efficient Differentiable method for Estimating Fundamental Frequency From Spectrograms 从谱图估计基频的一种计算效率高的可微方法
12th ISCA Speech Synthesis Workshop (SSW2023) Pub Date : 2023-08-26 DOI: 10.21437/ssw.2023-34
L. Finkelstein, Chun-an Chan, Vincent Wan, H. Zen, Rob Clark
{"title":"FiPPiE: A Computationally Efficient Differentiable method for Estimating Fundamental Frequency From Spectrograms","authors":"L. Finkelstein, Chun-an Chan, Vincent Wan, H. Zen, Rob Clark","doi":"10.21437/ssw.2023-34","DOIUrl":"https://doi.org/10.21437/ssw.2023-34","url":null,"abstract":"In this paper we present FiPPiE, a Filter-Inferred Pitch Poste-riorgram Estimator – a method of estimating fundamental frequency from spectrograms, either linear or mel, by applying a special kind of filter in the spectral domain. Unlike other works in this field, we developed a procedure for training an optimized filter (or kernel) for this type of estimation. FiPPiE, based on this optimized filter, demonstrated itself as a reliable fundamental frequency estimator that is computationally efficient, differentiable, and easily implementable. We demonstrate the performance of the method both by the analysis of its behavior on human recordings, and by the stability analysis with help of an automated system.","PeriodicalId":346639,"journal":{"name":"12th ISCA Speech Synthesis Workshop (SSW2023)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130648669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Voice Cloning: Training Speaker Selection with Limited Multi-Speaker Corpus 语音克隆:用有限的多说话人语料库训练说话人选择
12th ISCA Speech Synthesis Workshop (SSW2023) Pub Date : 2023-08-26 DOI: 10.21437/ssw.2023-27
David Guennec, Lily Wadoux, A. Sini, N. Barbot, Damien Lolive
{"title":"Voice Cloning: Training Speaker Selection with Limited Multi-Speaker Corpus","authors":"David Guennec, Lily Wadoux, A. Sini, N. Barbot, Damien Lolive","doi":"10.21437/ssw.2023-27","DOIUrl":"https://doi.org/10.21437/ssw.2023-27","url":null,"abstract":"Text-To-Speech synthesis with few data is a challenging task, in particular when choosing the target speaker is not an option. Voice cloning is a popular method to alleviate these issues using only a few minutes of target speech. To do this, the model must first be trained on a large corpus of thousands of hours and hundreds of speakers. In this paper, we tackle the challenge of cloning voices with a much smaller corpus, us-ing both the speaker adaptation and speaker encoding methods. We study the impact of selecting our training speakers based on their similarity to the targets. We train models using only the training speakers closest/farthest to our targets in terms of speaker similarity from a pool of 14 speakers. We show that the selection of speakers in the training set has an impact on the similarity to the target speaker. The effect is more prominent for speaker encoding than adaptation. However, it remains nuanced when it comes to naturalness.","PeriodicalId":346639,"journal":{"name":"12th ISCA Speech Synthesis Workshop (SSW2023)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130559594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Local Style Tokens: Fine-Grained Prosodic Representations For TTS Expressive Control 局部风格符号:用于TTS表达控制的细粒度韵律表示
12th ISCA Speech Synthesis Workshop (SSW2023) Pub Date : 2023-08-26 DOI: 10.21437/ssw.2023-19
Martin Lenglet, O. Perrotin, G. Bailly
{"title":"Local Style Tokens: Fine-Grained Prosodic Representations For TTS Expressive Control","authors":"Martin Lenglet, O. Perrotin, G. Bailly","doi":"10.21437/ssw.2023-19","DOIUrl":"https://doi.org/10.21437/ssw.2023-19","url":null,"abstract":"Neural Text-To-Speech (TTS) models achieve great performances regarding naturalness, but modeling expressivity remains an ongoing challenge. Some success was found through implicit approaches like Global Style Tokens (GST), but these methods model speech style at utterance-level. In this paper, we propose to add an auxiliary module called Local Style To-kens (LST) in the encoder-decoder pipeline to model local variations in prosody. This module can implement various scales of representations; we chose Word-level and Phoneme-level prosodic representations to assess the capabilities of the proposed module to better model sub-utterance style variations. Objective evaluation of the synthetic speech shows that LST modules better capture prosodic variations on 12 common styles compared to a GST baseline. These results were validated by participants during listening tests.","PeriodicalId":346639,"journal":{"name":"12th ISCA Speech Synthesis Workshop (SSW2023)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129257215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Impact of Pause-Internal Phonetic Particles on Recall in Synthesized Lectures 暂停-内音小品对综合讲座中回忆的影响
12th ISCA Speech Synthesis Workshop (SSW2023) Pub Date : 2023-08-26 DOI: 10.21437/ssw.2023-32
Mikey Elmers, Éva Székely
{"title":"The Impact of Pause-Internal Phonetic Particles on Recall in Synthesized Lectures","authors":"Mikey Elmers, Éva Székely","doi":"10.21437/ssw.2023-32","DOIUrl":"https://doi.org/10.21437/ssw.2023-32","url":null,"abstract":"We studied the effect of pause-internal phonetic particles (PINTs) on recall for native and non-native listeners of English in a listening experiment with synthesized material that simulated a university lecture. Using a neural speech synthesizer trained on recorded lectures with PINTs annotations, we generated three distinct conditions: a base version, a “silence” version where non-silence PINTs were replaced with silence, and a “nopints” version where all PINTs, including silences, were removed. Half of the participants were informed they were listening to computer-generated audio, while the other half were told the audio was recorded with a poor-quality microphone. We found that neither the condition nor the participants’ native language significantly affected their overall score, and the presence of PINTs before critical information had a negative effect on recall. This study highlights the importance of considering PINTs for educational purposes in speech synthesis systems.","PeriodicalId":346639,"journal":{"name":"12th ISCA Speech Synthesis Workshop (SSW2023)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123757529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Subjective Evaluation of Text-to-Speech Models: Comparing Absolute Category Rating and Ranking by Elimination Tests 文本-语音模型的主观评价:用消去测试比较绝对类别评定和排序
12th ISCA Speech Synthesis Workshop (SSW2023) Pub Date : 2023-08-26 DOI: 10.21437/ssw.2023-30
K. Lakshminarayana, C. Dittmar, N. Pia, Emanuël Habets
{"title":"Subjective Evaluation of Text-to-Speech Models: Comparing Absolute Category Rating and Ranking by Elimination Tests","authors":"K. Lakshminarayana, C. Dittmar, N. Pia, Emanuël Habets","doi":"10.21437/ssw.2023-30","DOIUrl":"https://doi.org/10.21437/ssw.2023-30","url":null,"abstract":"Modern text-to-speech (TTS) models are typically subjectively evaluated using an Absolute Category Rating (ACR) method. This method uses the mean opinion score to rate each model under test. However, if the models are perceptually too similar, assigning absolute ratings to stimuli might be difficult and prone to subjective preference errors. Pairwise comparison tests offer relative comparison and capture some of the subtle differences between the stimuli better. However, pairwise comparisons take more time as the number of tests increases exponentially with the number of models. Alternatively, a ranking-by-elimination (RBE) test can assess multiple models with similar benefits as pairwise comparisons for subtle differences across models without the time penalty. We compared the ACR and RBE tests for TTS evaluation in a controlled experiment. We found that the obtained results were statistically similar even in the presence of perceptually close TTS models.","PeriodicalId":346639,"journal":{"name":"12th ISCA Speech Synthesis Workshop (SSW2023)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126912376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Importance of Human Factors in Text-To-Speech Evaluations 人的因素在文本到语音评价中的重要性
12th ISCA Speech Synthesis Workshop (SSW2023) Pub Date : 2023-08-26 DOI: 10.21437/ssw.2023-5
L. Finkelstein, Joshua Camp, R. Clark
{"title":"Importance of Human Factors in Text-To-Speech Evaluations","authors":"L. Finkelstein, Joshua Camp, R. Clark","doi":"10.21437/ssw.2023-5","DOIUrl":"https://doi.org/10.21437/ssw.2023-5","url":null,"abstract":"Both mean opinion score (MOS) evaluations and preference tests in text-to-speech are often associated with high rating variance. In this paper we investigate two important factors that affect that variance. One factor is that the variance is coming from how raters are picked for a specific test, and another is the dynamic behavior of individual raters across time. This paper increases the awareness of these issues when designing an evaluation experiment, since the standard confidence interval on the test level cannot incorporate the variance associated with these two factors. We show the impact of the two sources of variance and how they can be mitigated. We demonstrate that simple improvements in experiment design such as using a smaller number of rating tasks per rater can significantly improve the experiment confidence intervals / reproducibility with no extra cost.","PeriodicalId":346639,"journal":{"name":"12th ISCA Speech Synthesis Workshop (SSW2023)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130086015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Situating Speech Synthesis: Investigating Contextual Factors in the Evaluation of Conversational TTS 情境语音合成:会话式TTS评价中的语境因素研究
12th ISCA Speech Synthesis Workshop (SSW2023) Pub Date : 2023-08-26 DOI: 10.21437/ssw.2023-11
Harm Lameris, Ambika Kirkland, Joakim Gustafson, Éva Székely
{"title":"Situating Speech Synthesis: Investigating Contextual Factors in the Evaluation of Conversational TTS","authors":"Harm Lameris, Ambika Kirkland, Joakim Gustafson, Éva Székely","doi":"10.21437/ssw.2023-11","DOIUrl":"https://doi.org/10.21437/ssw.2023-11","url":null,"abstract":"Speech synthesis evaluation methods have lagged behind the development of TTS systems, with single sentence read-speech MOS naturalness evaluation on crowdsourcing platforms being the industry standard. For TTS to successfully be applied in social contexts, evaluation methods need to be socially embedded in the situation where they will be deployed. Due to the time and cost constraints of conducting an in-person interaction evaluation for TTS, we examine the effect of introducing situational context and preceding sentence context to participants in a subjective listening experiment. We conduct a suitability evaluation for a robot game guide that explains game rules to participants using two synthesized spontaneous voices: an instruction-specific and a general spontaneous voice. Results indicate that the inclusion of context influences user ratings, highlighting the need for context-aware evaluations. However, the type of context did not significantly affect the results.","PeriodicalId":346639,"journal":{"name":"12th ISCA Speech Synthesis Workshop (SSW2023)","volume":"12 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129117551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Diffusion Transformer for Adaptive Text-to-Speech 自适应文本到语音的扩散变压器
12th ISCA Speech Synthesis Workshop (SSW2023) Pub Date : 2023-08-26 DOI: 10.21437/ssw.2023-25
Haolin Chen, Philip N. Garner
{"title":"Diffusion Transformer for Adaptive Text-to-Speech","authors":"Haolin Chen, Philip N. Garner","doi":"10.21437/ssw.2023-25","DOIUrl":"https://doi.org/10.21437/ssw.2023-25","url":null,"abstract":"Given the success of diffusion in synthesizing realistic speech, we investigate how diffusion can be included in adaptive text-to-speech systems. Inspired by the adaptable layer norm modules for Transformer, we adapt a new backbone of diffusion models, Diffusion Transformer, for acoustic modeling. Specifically, the adaptive layer norm in the architecture is used to condition the diffusion network on text representations, which further enables parameter-efficient adaptation. We show the new architecture to be a faster alternative to its convolutional counterpart for general text-to-speech, while demonstrating a clear advantage on naturalness and similarity over the Transformer for few-shot and few-parameter adaptation. In the zero-shot scenario, while the new backbone is a decent alternative, the main benefit of such an architecture is to enable high-quality parameter-efficient adaptation when finetuning is performed.","PeriodicalId":346639,"journal":{"name":"12th ISCA Speech Synthesis Workshop (SSW2023)","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121234668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Data Augmentation Methods on Ultrasound Tongue Images for Articulation-to-Speech Synthesis 用于发音-语音合成的超声舌图像数据增强方法
12th ISCA Speech Synthesis Workshop (SSW2023) Pub Date : 2023-08-26 DOI: 10.21437/ssw.2023-36
I. Ibrahimov, G. Gosztolya, T. Csapó
{"title":"Data Augmentation Methods on Ultrasound Tongue Images for Articulation-to-Speech Synthesis","authors":"I. Ibrahimov, G. Gosztolya, T. Csapó","doi":"10.21437/ssw.2023-36","DOIUrl":"https://doi.org/10.21437/ssw.2023-36","url":null,"abstract":"Articulation-to-Speech Synthesis (ATS) focuses on converting articulatory biosignal information into audible speech, nowadays mostly using DNNs, with a future target application of a Silent Speech Interface. Ultrasound Tongue Imaging (UTI) is an affordable and non-invasive technique that has become popular for collecting articulatory data. Data augmentation has been shown to improve the generalization ability of DNNs, e.g. to avoid overfitting, introduce variations into the existing dataset, or make the network more robust against various noise types on the input data. In this paper, we compare six different data augmentation methods on the UltraSuite-TaL corpus during UTI-based ATS using CNNs. Validation mean squared error is used to evaluate the performance of CNNs, while by the synthesized speech samples, the performace of direct ATS is measured us-ing MCD and PESQ scores. Although we did not find large differences in the outcome of various data augmentation techniques, the results of this study suggest that while applying data augmentation techniques on UTI poses some challenges due to the unique nature of the data, it provides benefits in terms of enhancing the robustness of neural networks. In general, articulatory control might be beneficial in TTS as well.","PeriodicalId":346639,"journal":{"name":"12th ISCA Speech Synthesis Workshop (SSW2023)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130703672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cross-lingual transfer using phonological features for resource-scarce text-to-speech 利用语音特征进行资源稀缺的文本到语音的跨语言迁移
12th ISCA Speech Synthesis Workshop (SSW2023) Pub Date : 2023-08-26 DOI: 10.21437/ssw.2023-9
J. A. Louw
{"title":"Cross-lingual transfer using phonological features for resource-scarce text-to-speech","authors":"J. A. Louw","doi":"10.21437/ssw.2023-9","DOIUrl":"https://doi.org/10.21437/ssw.2023-9","url":null,"abstract":"In this work, we explore the use of phonological features in cross-lingual transfer within resource-scarce settings. We modify the architecture of VITS to accept a phonological feature vector as input, instead of phonemes or characters. Subsequently, we train multispeaker base models using data from LibriTTS and then fine-tune them on single-speaker Afrikaans and isiXhosa datasets of varying sizes, representing the resourcescarce setting. We evaluate the synthetic speech both objectively and subjectively and compare it to models trained with the same data using the standard VITS architecture. In our experiments, the proposed system utilizing phonological features as input converges significantly faster and requires less data than the base system. We demonstrate that the model employing phonological features is capable of producing sounds in the target language that were unseen in the source language, even in languages with significant linguistic differences, and with only 5 minutes of data in the target language.","PeriodicalId":346639,"journal":{"name":"12th ISCA Speech Synthesis Workshop (SSW2023)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132928318","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信