Diffusion Transformer for Adaptive Text-to-Speech

12th ISCA Speech Synthesis Workshop (SSW2023) Pub Date : 2023-08-26 DOI:10.21437/ssw.2023-25

Haolin Chen, Philip N. Garner

引用次数: 1

Abstract

Given the success of diffusion in synthesizing realistic speech, we investigate how diffusion can be included in adaptive text-to-speech systems. Inspired by the adaptable layer norm modules for Transformer, we adapt a new backbone of diffusion models, Diffusion Transformer, for acoustic modeling. Specifically, the adaptive layer norm in the architecture is used to condition the diffusion network on text representations, which further enables parameter-efficient adaptation. We show the new architecture to be a faster alternative to its convolutional counterpart for general text-to-speech, while demonstrating a clear advantage on naturalness and similarity over the Transformer for few-shot and few-parameter adaptation. In the zero-shot scenario, while the new backbone is a decent alternative, the main benefit of such an architecture is to enable high-quality parameter-efficient adaptation when finetuning is performed.

查看原文本刊更多论文

自适应文本到语音的扩散变压器

鉴于扩散在合成真实语音方面的成功，我们研究了如何将扩散纳入自适应文本到语音系统。受Transformer的自适应层范数模块的启发，我们采用了一种新的扩散模型主干——diffusion Transformer来进行声学建模。具体而言，该体系结构中的自适应层规范用于约束文本表示的扩散网络，从而进一步实现参数高效自适应。对于一般的文本到语音的转换，我们展示了新的体系结构是比卷积体系结构更快的替代方案，同时展示了在自然性和相似性方面比Transformer具有更少镜头和更少参数适应的明显优势。在零攻击场景中，虽然新的主干是一个不错的替代方案，但这种架构的主要优点是在执行调优时能够实现高质量的参数高效适应。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

12th ISCA Speech Synthesis Workshop (SSW2023)

自引率

0.00%

发文量