12th ISCA Speech Synthesis Workshop (SSW2023)最新文献

筛选
英文 中文
Federated Learning for Human-in-the-Loop Many-to-Many Voice Conversion 人在循环中多对多语音转换的联邦学习
12th ISCA Speech Synthesis Workshop (SSW2023) Pub Date : 2023-08-26 DOI: 10.21437/ssw.2023-15
Ryunosuke Hirai, Yuki Saito, H. Saruwatari
{"title":"Federated Learning for Human-in-the-Loop Many-to-Many Voice Conversion","authors":"Ryunosuke Hirai, Yuki Saito, H. Saruwatari","doi":"10.21437/ssw.2023-15","DOIUrl":"https://doi.org/10.21437/ssw.2023-15","url":null,"abstract":"We propose a method for training a many-to-many voice conversion (VC) model that can additionally learn users’ voices while protecting the privacy of their data. Conventional many-to-many VC methods train a VC model using a publicly available or proprietary multi-speaker corpus. However, they do not always achieve high-quality VC for input speech from various users. Our method is based on federated learning, a framework of distributed machine learning where a developer and users cooperatively train a machine learning model while protecting the privacy of user-owned data. We present a proof-of-concept method on the basis of StarGANv2-VC (i.e., Fed-StarGANv2-VC) and demonstrate that our method can achieve speaker similarity comparable to conventional non-federated StarGANv2-VC.","PeriodicalId":346639,"journal":{"name":"12th ISCA Speech Synthesis Workshop (SSW2023)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133721035","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Lightweight End-to-end Text-to-speech Synthesis for low resource on-device applications 用于低资源设备上应用程序的轻量级端到端文本到语音合成
12th ISCA Speech Synthesis Workshop (SSW2023) Pub Date : 2023-08-26 DOI: 10.21437/ssw.2023-35
Biel Tura Vecino, Adam Gabrys, Daniel Matwicki, Andrzej Pomirski, Tom Iddon, Marius Cotescu, Jaime Lorenzo-Trueba
{"title":"Lightweight End-to-end Text-to-speech Synthesis for low resource on-device applications","authors":"Biel Tura Vecino, Adam Gabrys, Daniel Matwicki, Andrzej Pomirski, Tom Iddon, Marius Cotescu, Jaime Lorenzo-Trueba","doi":"10.21437/ssw.2023-35","DOIUrl":"https://doi.org/10.21437/ssw.2023-35","url":null,"abstract":"Recent works have shown that modelling raw waveform directly from text in an end-to-end (E2E) fashion produces more natural-sounding speech than traditional neural text-to-speech (TTS) systems based on a cascade or two-stage approach. However, current E2E state-of-the-art models are computationally complex and memory-consuming, making them unsuitable for real-time offline on-device applications in low-resource scenarios. To address this issue, we propose a Lightweight E2E-TTS (LE2E) model that generates high-quality speech requiring minimal computational resources. We evaluate the proposed model on the LJSpeech dataset and show that it achieves state-of-the-art performance while being up to 90% smaller in terms of model parameters and 10 × faster in real-time-factor. Furthermore, we demonstrate that the proposed E2E training paradigm achieves better quality compared to an equivalent architecture trained in a two-stage approach. Our results suggest that LE2E is a promising approach for developing real-time, high quality, low-resource TTS applications for on-device applications.","PeriodicalId":346639,"journal":{"name":"12th ISCA Speech Synthesis Workshop (SSW2023)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128172769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MooseNet: A Trainable Metric for Synthesized Speech with a PLDA Module MooseNet:基于PLDA模块的可训练合成语音度量
12th ISCA Speech Synthesis Workshop (SSW2023) Pub Date : 2023-01-17 DOI: 10.21437/ssw.2023-8
Ondvrej Pl'atek, Ondrej Dusek
{"title":"MooseNet: A Trainable Metric for Synthesized Speech with a PLDA Module","authors":"Ondvrej Pl'atek, Ondrej Dusek","doi":"10.21437/ssw.2023-8","DOIUrl":"https://doi.org/10.21437/ssw.2023-8","url":null,"abstract":"We present MooseNet, a trainable speech metric that predicts the listeners' Mean Opinion Score (MOS). We propose a novel approach where the Probabilistic Linear Discriminative Analysis (PLDA) generative model is used on top of an embedding obtained from a self-supervised learning (SSL) neural network (NN) model. We show that PLDA works well with a non-finetuned SSL model when trained only on 136 utterances (ca. one minute training time) and that PLDA consistently improves various neural MOS prediction models, even state-of-the-art models with task-specific fine-tuning. Our ablation study shows PLDA training superiority over SSL model fine-tuning in a low-resource scenario. We also improve SSL model fine-tuning using a convenient optimizer choice and additional contrastive and multi-task training objectives. The fine-tuned MooseNet NN with the PLDA module achieves the best results, surpassing the SSL baseline on the VoiceMOS Challenge data.","PeriodicalId":346639,"journal":{"name":"12th ISCA Speech Synthesis Workshop (SSW2023)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130913528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Improving the quality of neural TTS using long-form content and multi-speaker multi-style modeling 利用长内容和多说话人多风格建模提高神经TTS的质量
12th ISCA Speech Synthesis Workshop (SSW2023) Pub Date : 2022-12-20 DOI: 10.21437/ssw.2023-23
T. Raitio, Javier Latorre, Andrea Davis, Tuuli H. Morrill, L. Golipour
{"title":"Improving the quality of neural TTS using long-form content and multi-speaker multi-style modeling","authors":"T. Raitio, Javier Latorre, Andrea Davis, Tuuli H. Morrill, L. Golipour","doi":"10.21437/ssw.2023-23","DOIUrl":"https://doi.org/10.21437/ssw.2023-23","url":null,"abstract":"Neural text-to-speech (TTS) can provide quality close to natural speech if an adequate amount of high-quality speech material is available for training. However, acquiring speech data for TTS training is costly and time-consuming, especially if the goal is to generate different speaking styles. In this work, we show that we can transfer speaking style across speakers and improve the quality of synthetic speech by training a multi-speaker multi-style (MSMS) model with long-form recordings, in addition to regular TTS recordings. In particular, we show that 1) multi-speaker modeling improves the overall TTS quality, 2) the proposed MSMS approach outperforms pre-training and fine-tuning approach when utilizing additional multi-speaker data, and 3) long-form speaking style is highly rated regardless of the target text domain.","PeriodicalId":346639,"journal":{"name":"12th ISCA Speech Synthesis Workshop (SSW2023)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133923655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improving robustness of spontaneous speech synthesis with linguistic speech regularization and pseudo-filled-pause insertion 用语言语音正则化和伪填充停顿插入提高自发语音合成的鲁棒性
12th ISCA Speech Synthesis Workshop (SSW2023) Pub Date : 2022-10-18 DOI: 10.21437/ssw.2023-10
Yuta Matsunaga, Takaaki Saeki, Shinnosuke Takamichi, H. Saruwatari
{"title":"Improving robustness of spontaneous speech synthesis with linguistic speech regularization and pseudo-filled-pause insertion","authors":"Yuta Matsunaga, Takaaki Saeki, Shinnosuke Takamichi, H. Saruwatari","doi":"10.21437/ssw.2023-10","DOIUrl":"https://doi.org/10.21437/ssw.2023-10","url":null,"abstract":"We present a training method with linguistic speech regularization that improves the robustness of spontaneous speech synthesis methods with filled pause (FP) insertion. Spontaneous speech synthesis is aimed at producing speech with human-like disfluencies, such as FPs. Because modeling the complex data distribution of spontaneous speech with a rich FP vocabulary is challenging, the quality of FP-inserted synthetic speech is often limited. To address this issue, we present a method for synthesizing spontaneous speech that improves robustness to diverse FP insertions. Regularization is used to stabilize the synthesis of the linguistic speech (i.e., non-FP) elements. To further improve robustness to diverse FP insertions, it utilizes pseudo-FPs sampled using an FP word prediction model as well as ground-truth FPs. Our experiments demonstrated that the proposed method improves the naturalness of synthetic speech with ground-truth and predicted FPs by 0.24 and 0.26, respectively.","PeriodicalId":346639,"journal":{"name":"12th ISCA Speech Synthesis Workshop (SSW2023)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114779381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信