{"title":"SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis","authors":"Helin Wang, Meng Yu, Jiarui Hai, Chen Chen, Yuchen Hu, Rilin Chen, Najim Dehak, Dong Yu","doi":"arxiv-2409.07556","DOIUrl":null,"url":null,"abstract":"In this paper, we introduce SSR-Speech, a neural codec autoregressive model\ndesigned for stable, safe, and robust zero-shot text-based speech editing and\ntext-to-speech synthesis. SSR-Speech is built on a Transformer decoder and\nincorporates classifier-free guidance to enhance the stability of the\ngeneration process. A watermark Encodec is proposed to embed frame-level\nwatermarks into the edited regions of the speech so that which parts were\nedited can be detected. In addition, the waveform reconstruction leverages the\noriginal unedited speech segments, providing superior recovery compared to the\nEncodec model. Our approach achieves the state-of-the-art performance in the\nRealEdit speech editing task and the LibriTTS text-to-speech task, surpassing\nprevious methods. Furthermore, SSR-Speech excels in multi-span speech editing\nand also demonstrates remarkable robustness to background sounds. Source code\nand demos are released.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07556","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In this paper, we introduce SSR-Speech, a neural codec autoregressive model
designed for stable, safe, and robust zero-shot text-based speech editing and
text-to-speech synthesis. SSR-Speech is built on a Transformer decoder and
incorporates classifier-free guidance to enhance the stability of the
generation process. A watermark Encodec is proposed to embed frame-level
watermarks into the edited regions of the speech so that which parts were
edited can be detected. In addition, the waveform reconstruction leverages the
original unedited speech segments, providing superior recovery compared to the
Encodec model. Our approach achieves the state-of-the-art performance in the
RealEdit speech editing task and the LibriTTS text-to-speech task, surpassing
previous methods. Furthermore, SSR-Speech excels in multi-span speech editing
and also demonstrates remarkable robustness to background sounds. Source code
and demos are released.