L. Finkelstein, Chun-an Chan, Vincent Wan, H. Zen, Rob Clark
{"title":"FiPPiE: A Computationally Efficient Differentiable method for Estimating Fundamental Frequency From Spectrograms","authors":"L. Finkelstein, Chun-an Chan, Vincent Wan, H. Zen, Rob Clark","doi":"10.21437/ssw.2023-34","DOIUrl":"https://doi.org/10.21437/ssw.2023-34","url":null,"abstract":"In this paper we present FiPPiE, a Filter-Inferred Pitch Poste-riorgram Estimator – a method of estimating fundamental frequency from spectrograms, either linear or mel, by applying a special kind of filter in the spectral domain. Unlike other works in this field, we developed a procedure for training an optimized filter (or kernel) for this type of estimation. FiPPiE, based on this optimized filter, demonstrated itself as a reliable fundamental frequency estimator that is computationally efficient, differentiable, and easily implementable. We demonstrate the performance of the method both by the analysis of its behavior on human recordings, and by the stability analysis with help of an automated system.","PeriodicalId":346639,"journal":{"name":"12th ISCA Speech Synthesis Workshop (SSW2023)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130648669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
David Guennec, Lily Wadoux, A. Sini, N. Barbot, Damien Lolive
{"title":"Voice Cloning: Training Speaker Selection with Limited Multi-Speaker Corpus","authors":"David Guennec, Lily Wadoux, A. Sini, N. Barbot, Damien Lolive","doi":"10.21437/ssw.2023-27","DOIUrl":"https://doi.org/10.21437/ssw.2023-27","url":null,"abstract":"Text-To-Speech synthesis with few data is a challenging task, in particular when choosing the target speaker is not an option. Voice cloning is a popular method to alleviate these issues using only a few minutes of target speech. To do this, the model must first be trained on a large corpus of thousands of hours and hundreds of speakers. In this paper, we tackle the challenge of cloning voices with a much smaller corpus, us-ing both the speaker adaptation and speaker encoding methods. We study the impact of selecting our training speakers based on their similarity to the targets. We train models using only the training speakers closest/farthest to our targets in terms of speaker similarity from a pool of 14 speakers. We show that the selection of speakers in the training set has an impact on the similarity to the target speaker. The effect is more prominent for speaker encoding than adaptation. However, it remains nuanced when it comes to naturalness.","PeriodicalId":346639,"journal":{"name":"12th ISCA Speech Synthesis Workshop (SSW2023)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130559594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Local Style Tokens: Fine-Grained Prosodic Representations For TTS Expressive Control","authors":"Martin Lenglet, O. Perrotin, G. Bailly","doi":"10.21437/ssw.2023-19","DOIUrl":"https://doi.org/10.21437/ssw.2023-19","url":null,"abstract":"Neural Text-To-Speech (TTS) models achieve great performances regarding naturalness, but modeling expressivity remains an ongoing challenge. Some success was found through implicit approaches like Global Style Tokens (GST), but these methods model speech style at utterance-level. In this paper, we propose to add an auxiliary module called Local Style To-kens (LST) in the encoder-decoder pipeline to model local variations in prosody. This module can implement various scales of representations; we chose Word-level and Phoneme-level prosodic representations to assess the capabilities of the proposed module to better model sub-utterance style variations. Objective evaluation of the synthetic speech shows that LST modules better capture prosodic variations on 12 common styles compared to a GST baseline. These results were validated by participants during listening tests.","PeriodicalId":346639,"journal":{"name":"12th ISCA Speech Synthesis Workshop (SSW2023)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129257215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Impact of Pause-Internal Phonetic Particles on Recall in Synthesized Lectures","authors":"Mikey Elmers, Éva Székely","doi":"10.21437/ssw.2023-32","DOIUrl":"https://doi.org/10.21437/ssw.2023-32","url":null,"abstract":"We studied the effect of pause-internal phonetic particles (PINTs) on recall for native and non-native listeners of English in a listening experiment with synthesized material that simulated a university lecture. Using a neural speech synthesizer trained on recorded lectures with PINTs annotations, we generated three distinct conditions: a base version, a “silence” version where non-silence PINTs were replaced with silence, and a “nopints” version where all PINTs, including silences, were removed. Half of the participants were informed they were listening to computer-generated audio, while the other half were told the audio was recorded with a poor-quality microphone. We found that neither the condition nor the participants’ native language significantly affected their overall score, and the presence of PINTs before critical information had a negative effect on recall. This study highlights the importance of considering PINTs for educational purposes in speech synthesis systems.","PeriodicalId":346639,"journal":{"name":"12th ISCA Speech Synthesis Workshop (SSW2023)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123757529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Lakshminarayana, C. Dittmar, N. Pia, Emanuël Habets
{"title":"Subjective Evaluation of Text-to-Speech Models: Comparing Absolute Category Rating and Ranking by Elimination Tests","authors":"K. Lakshminarayana, C. Dittmar, N. Pia, Emanuël Habets","doi":"10.21437/ssw.2023-30","DOIUrl":"https://doi.org/10.21437/ssw.2023-30","url":null,"abstract":"Modern text-to-speech (TTS) models are typically subjectively evaluated using an Absolute Category Rating (ACR) method. This method uses the mean opinion score to rate each model under test. However, if the models are perceptually too similar, assigning absolute ratings to stimuli might be difficult and prone to subjective preference errors. Pairwise comparison tests offer relative comparison and capture some of the subtle differences between the stimuli better. However, pairwise comparisons take more time as the number of tests increases exponentially with the number of models. Alternatively, a ranking-by-elimination (RBE) test can assess multiple models with similar benefits as pairwise comparisons for subtle differences across models without the time penalty. We compared the ACR and RBE tests for TTS evaluation in a controlled experiment. We found that the obtained results were statistically similar even in the presence of perceptually close TTS models.","PeriodicalId":346639,"journal":{"name":"12th ISCA Speech Synthesis Workshop (SSW2023)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126912376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Importance of Human Factors in Text-To-Speech Evaluations","authors":"L. Finkelstein, Joshua Camp, R. Clark","doi":"10.21437/ssw.2023-5","DOIUrl":"https://doi.org/10.21437/ssw.2023-5","url":null,"abstract":"Both mean opinion score (MOS) evaluations and preference tests in text-to-speech are often associated with high rating variance. In this paper we investigate two important factors that affect that variance. One factor is that the variance is coming from how raters are picked for a specific test, and another is the dynamic behavior of individual raters across time. This paper increases the awareness of these issues when designing an evaluation experiment, since the standard confidence interval on the test level cannot incorporate the variance associated with these two factors. We show the impact of the two sources of variance and how they can be mitigated. We demonstrate that simple improvements in experiment design such as using a smaller number of rating tasks per rater can significantly improve the experiment confidence intervals / reproducibility with no extra cost.","PeriodicalId":346639,"journal":{"name":"12th ISCA Speech Synthesis Workshop (SSW2023)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130086015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Situating Speech Synthesis: Investigating Contextual Factors in the Evaluation of Conversational TTS","authors":"Harm Lameris, Ambika Kirkland, Joakim Gustafson, Éva Székely","doi":"10.21437/ssw.2023-11","DOIUrl":"https://doi.org/10.21437/ssw.2023-11","url":null,"abstract":"Speech synthesis evaluation methods have lagged behind the development of TTS systems, with single sentence read-speech MOS naturalness evaluation on crowdsourcing platforms being the industry standard. For TTS to successfully be applied in social contexts, evaluation methods need to be socially embedded in the situation where they will be deployed. Due to the time and cost constraints of conducting an in-person interaction evaluation for TTS, we examine the effect of introducing situational context and preceding sentence context to participants in a subjective listening experiment. We conduct a suitability evaluation for a robot game guide that explains game rules to participants using two synthesized spontaneous voices: an instruction-specific and a general spontaneous voice. Results indicate that the inclusion of context influences user ratings, highlighting the need for context-aware evaluations. However, the type of context did not significantly affect the results.","PeriodicalId":346639,"journal":{"name":"12th ISCA Speech Synthesis Workshop (SSW2023)","volume":"12 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129117551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Diffusion Transformer for Adaptive Text-to-Speech","authors":"Haolin Chen, Philip N. Garner","doi":"10.21437/ssw.2023-25","DOIUrl":"https://doi.org/10.21437/ssw.2023-25","url":null,"abstract":"Given the success of diffusion in synthesizing realistic speech, we investigate how diffusion can be included in adaptive text-to-speech systems. Inspired by the adaptable layer norm modules for Transformer, we adapt a new backbone of diffusion models, Diffusion Transformer, for acoustic modeling. Specifically, the adaptive layer norm in the architecture is used to condition the diffusion network on text representations, which further enables parameter-efficient adaptation. We show the new architecture to be a faster alternative to its convolutional counterpart for general text-to-speech, while demonstrating a clear advantage on naturalness and similarity over the Transformer for few-shot and few-parameter adaptation. In the zero-shot scenario, while the new backbone is a decent alternative, the main benefit of such an architecture is to enable high-quality parameter-efficient adaptation when finetuning is performed.","PeriodicalId":346639,"journal":{"name":"12th ISCA Speech Synthesis Workshop (SSW2023)","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121234668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Data Augmentation Methods on Ultrasound Tongue Images for Articulation-to-Speech Synthesis","authors":"I. Ibrahimov, G. Gosztolya, T. Csapó","doi":"10.21437/ssw.2023-36","DOIUrl":"https://doi.org/10.21437/ssw.2023-36","url":null,"abstract":"Articulation-to-Speech Synthesis (ATS) focuses on converting articulatory biosignal information into audible speech, nowadays mostly using DNNs, with a future target application of a Silent Speech Interface. Ultrasound Tongue Imaging (UTI) is an affordable and non-invasive technique that has become popular for collecting articulatory data. Data augmentation has been shown to improve the generalization ability of DNNs, e.g. to avoid overfitting, introduce variations into the existing dataset, or make the network more robust against various noise types on the input data. In this paper, we compare six different data augmentation methods on the UltraSuite-TaL corpus during UTI-based ATS using CNNs. Validation mean squared error is used to evaluate the performance of CNNs, while by the synthesized speech samples, the performace of direct ATS is measured us-ing MCD and PESQ scores. Although we did not find large differences in the outcome of various data augmentation techniques, the results of this study suggest that while applying data augmentation techniques on UTI poses some challenges due to the unique nature of the data, it provides benefits in terms of enhancing the robustness of neural networks. In general, articulatory control might be beneficial in TTS as well.","PeriodicalId":346639,"journal":{"name":"12th ISCA Speech Synthesis Workshop (SSW2023)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130703672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cross-lingual transfer using phonological features for resource-scarce text-to-speech","authors":"J. A. Louw","doi":"10.21437/ssw.2023-9","DOIUrl":"https://doi.org/10.21437/ssw.2023-9","url":null,"abstract":"In this work, we explore the use of phonological features in cross-lingual transfer within resource-scarce settings. We modify the architecture of VITS to accept a phonological feature vector as input, instead of phonemes or characters. Subsequently, we train multispeaker base models using data from LibriTTS and then fine-tune them on single-speaker Afrikaans and isiXhosa datasets of varying sizes, representing the resourcescarce setting. We evaluate the synthetic speech both objectively and subjectively and compare it to models trained with the same data using the standard VITS architecture. In our experiments, the proposed system utilizing phonological features as input converges significantly faster and requires less data than the base system. We demonstrate that the model employing phonological features is capable of producing sounds in the target language that were unseen in the source language, even in languages with significant linguistic differences, and with only 5 minutes of data in the target language.","PeriodicalId":346639,"journal":{"name":"12th ISCA Speech Synthesis Workshop (SSW2023)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132928318","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}