12th ISCA Speech Synthesis Workshop (SSW2023)最新文献

Audiobook synthesis with long-form neural text-to-speech 语音读物合成与长形式的神经文本到语音

12th ISCA Speech Synthesis Workshop (SSW2023) Pub Date : 2023-08-26 DOI: 10.21437/ssw.2023-22

Weicheng Zhang, Cheng-chieh Yeh, Will Beckman, T. Raitio, Ramya Rasipuram, L. Golipour, David Winarsky

{"title":"Audiobook synthesis with long-form neural text-to-speech","authors":"Weicheng Zhang, Cheng-chieh Yeh, Will Beckman, T. Raitio, Ramya Rasipuram, L. Golipour, David Winarsky","doi":"10.21437/ssw.2023-22","DOIUrl":"https://doi.org/10.21437/ssw.2023-22","url":null,"abstract":"Despite recent advances in text-to-speech (TTS) technology, auto-narration of long-form content such as books remains a challenge. The goal of this work is to enhance neural TTS to be suitable for long-form content such as audiobooks. In addition to high quality, we aim to provide a compelling and engaging listening experience with expressivity that spans beyond a single sentence to a paragraph level so that the user can not only follow the story but also enjoy listening to it. Towards that goal, we made four enhancements to our baseline TTS system: incorporation of BERT embeddings, explicit prosody prediction from text, long-context modeling over multiple sentences, and pre-training on long-form data. We propose an evaluation framework tailored to long-form content that evaluates the synthesis on segments spanning multiple paragraphs and focuses on elements such as comprehension, ease of listening, ability to keep attention, and enjoyment. The evaluation results show that the proposed approach outperforms the baseline on all evaluated metrics, with an absolute 0.47 MOS gain in overall quality. Ablation studies further confirm the effectiveness of the proposed enhancements.","PeriodicalId":346639,"journal":{"name":"12th ISCA Speech Synthesis Workshop (SSW2023)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122861468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Adaptive Duration Modification of Speech using Masked Convolutional Networks and Open-Loop Time Warping 基于掩模卷积网络和开环时间扭曲的语音持续时间自适应修正

12th ISCA Speech Synthesis Workshop (SSW2023) Pub Date : 2023-08-26 DOI: 10.21437/ssw.2023-28

Ravi Shankar, Archana Venkataraman

引用次数: 0

Better Replacement for TTS Naturalness Evaluation 更好地替代TTS自然度评价

12th ISCA Speech Synthesis Workshop (SSW2023) Pub Date : 2023-08-26 DOI: 10.21437/ssw.2023-31

S. Shirali-Shahreza, Gerald Penn

引用次数: 0

Advocating for text input in multi-speaker text-to-speech systems 提倡在多扬声器文本到语音系统中进行文本输入

12th ISCA Speech Synthesis Workshop (SSW2023) Pub Date : 2023-08-26 DOI: 10.21437/ssw.2023-1

G. Bailly, Martin Lenglet, O. Perrotin, E. Klabbers

引用次数: 0

SPTK4: An Open-Source Software Toolkit for Speech Signal Processing SPTK4:一个用于语音信号处理的开源软件工具包

12th ISCA Speech Synthesis Workshop (SSW2023) Pub Date : 2023-08-26 DOI: 10.21437/ssw.2023-33

Takenori Yoshimura, Takato Fujimoto, Keiichiro Oura, K. Tokuda

引用次数: 0

Spell4TTS: Acoustically-informed spellings for improving text-to-speech pronunciations Spell4TTS:声学信息拼写，提高文本到语音的发音

12th ISCA Speech Synthesis Workshop (SSW2023) Pub Date : 2023-08-26 DOI: 10.21437/ssw.2023-2

Jason Fong, Hao Tang, Simon King

{"title":"Spell4TTS: Acoustically-informed spellings for improving text-to-speech pronunciations","authors":"Jason Fong, Hao Tang, Simon King","doi":"10.21437/ssw.2023-2","DOIUrl":"https://doi.org/10.21437/ssw.2023-2","url":null,"abstract":"Ensuring accurate pronunciation is critical for high-quality text-to-speech (TTS). This typically requires a phoneme-based pro-nunciation dictionary, which is labour-intensive and costly to create. Previous work has suggested using graphemes instead of phonemes, but the inevitable pronunciation errors that occur cannot be fixed, since there is no longer a pronunciation dictionary. As an alternative, speech-based self-supervised learning (SSL) models have been proposed for pronunciation control, but these models are computationally expensive to train, produce representations that are not easily interpretable, and capture unwanted non-phonemic information. To address these limitations, we propose Spell4TTS, a novel method that generates acoustically-informed word spellings. Spellings are both inter-pretable and easily edited. The method could be applied to any existing pre-built TTS system. Our experiments show that the method creates word spellings that lead to fewer TTS pronunciation errors than the original spellings, or an Automatic Speech Recognition baseline. Additionally, we observe that pronunciation can be further enhanced by ranking candidates in the space of SSL speech representations, and by incorporating Human-in-the-Loop screening over the top-ranked spellings devised by our method. By working with spellings of words (composed of characters), the method lowers the entry barrier for TTS sys-tem development for languages with limited pronunciation resources. It should reduce the time and cost involved in creating and maintaining pronunciation dictionaries.","PeriodicalId":346639,"journal":{"name":"12th ISCA Speech Synthesis Workshop (SSW2023)","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132536684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PRVAE-VC: Non-Parallel Many-to-Many Voice Conversion with Perturbation-Resistant Variational Autoencoder 基于抗扰动变分自编码器的非并行多对多语音转换

12th ISCA Speech Synthesis Workshop (SSW2023) Pub Date : 2023-08-26 DOI: 10.21437/ssw.2023-14

Kou Tanaka, H. Kameoka, Takuhiro Kaneko

{"title":"PRVAE-VC: Non-Parallel Many-to-Many Voice Conversion with Perturbation-Resistant Variational Autoencoder","authors":"Kou Tanaka, H. Kameoka, Takuhiro Kaneko","doi":"10.21437/ssw.2023-14","DOIUrl":"https://doi.org/10.21437/ssw.2023-14","url":null,"abstract":"This paper describes a novel approach to non-parallel many-to-many voice conversion (VC) that utilizes a variant of the conditional variational autoencoder (VAE) called a perturbation-resistant VAE (PRVAE). In VAE-based VC, it is commonly assumed that the encoder extracts content from the input speech while removing source speaker information. Following this extraction, the decoder generates output from the extracted content and target speaker information. However, in practice, the encoded features may still retain source speaker information, which can lead to a degradation of speech quality during speaker conversion tasks. To address this issue, we propose a perturbation-resistant encoder trained to match the encoded features of the input speech with those of a pseudo-speech generated through a content-preserving transformation of the input speech’s fundamental frequency and spectral envelope using a combination of pure signal processing techniques. Our experimental results demonstrate that this straightforward constraint signiﬁcantly enhances the performance in non-parallel many-to-many speaker conversion tasks. Audio samples can be accessed at our webpage 1 .","PeriodicalId":346639,"journal":{"name":"12th ISCA Speech Synthesis Workshop (SSW2023)","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128169759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Stuck in the MOS pit: A critical analysis of MOS test methodology in TTS evaluation 陷入MOS陷阱:TTS评估中MOS测试方法的批判性分析

12th ISCA Speech Synthesis Workshop (SSW2023) Pub Date : 2023-08-26 DOI: 10.21437/ssw.2023-7

Ambika Kirkland, Shivam Mehta, Harm Lameris, G. Henter, Éva Székely, Joakim Gustafson

引用次数: 3

Synthesising turn-taking cues using natural conversational data 使用自然会话数据合成轮转提示

12th ISCA Speech Synthesis Workshop (SSW2023) Pub Date : 2023-08-26 DOI: 10.21437/ssw.2023-12

Johannah O'Mahony, Catherine Lai, Simon King

引用次数: 0

Re-examining the quality dimensions of synthetic speech 重新审视合成语音的质量维度

12th ISCA Speech Synthesis Workshop (SSW2023) Pub Date : 2023-08-26 DOI: 10.21437/ssw.2023-6

Fritz Seebauer, Michael Kuhlmann, Reinhold Haeb-Umbach, P. Wagner

引用次数: 0