Xin Qi, Ruibo Fu, Zhengqi Wen, Tao Wang, Chunyu Qiang, Jianhua Tao, Chenxing Li, Yi Lu, Shuchen Shi, Zhiyong Wang, Xiaopeng Wang, Yuankun Xie, Yukun Liu, Xuefei Liu, Guanjun Li
{"title":"DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech","authors":"Xin Qi, Ruibo Fu, Zhengqi Wen, Tao Wang, Chunyu Qiang, Jianhua Tao, Chenxing Li, Yi Lu, Shuchen Shi, Zhiyong Wang, Xiaopeng Wang, Yuankun Xie, Yukun Liu, Xuefei Liu, Guanjun Li","doi":"arxiv-2409.11835","DOIUrl":null,"url":null,"abstract":"In recent years, speech diffusion models have advanced rapidly. Alongside the\nwidely used U-Net architecture, transformer-based models such as the Diffusion\nTransformer (DiT) have also gained attention. However, current DiT speech\nmodels treat Mel spectrograms as general images, which overlooks the specific\nacoustic properties of speech. To address these limitations, we propose a\nmethod called Directional Patch Interaction for Text-to-Speech (DPI-TTS), which\nbuilds on DiT and achieves fast training without compromising accuracy.\nNotably, DPI-TTS employs a low-to-high frequency, frame-by-frame progressive\ninference approach that aligns more closely with acoustic properties, enhancing\nthe naturalness of the generated speech. Additionally, we introduce a\nfine-grained style temporal modeling method that further improves speaker style\nsimilarity. Experimental results demonstrate that our method increases the\ntraining speed by nearly 2 times and significantly outperforms the baseline\nmodels.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11835","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In recent years, speech diffusion models have advanced rapidly. Alongside the
widely used U-Net architecture, transformer-based models such as the Diffusion
Transformer (DiT) have also gained attention. However, current DiT speech
models treat Mel spectrograms as general images, which overlooks the specific
acoustic properties of speech. To address these limitations, we propose a
method called Directional Patch Interaction for Text-to-Speech (DPI-TTS), which
builds on DiT and achieves fast training without compromising accuracy.
Notably, DPI-TTS employs a low-to-high frequency, frame-by-frame progressive
inference approach that aligns more closely with acoustic properties, enhancing
the naturalness of the generated speech. Additionally, we introduce a
fine-grained style temporal modeling method that further improves speaker style
similarity. Experimental results demonstrate that our method increases the
training speed by nearly 2 times and significantly outperforms the baseline
models.