StreamMel：实时零镜头文本到语音通过交错连续自回归建模

IF 3.9 2区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Signal Processing Letters Pub Date : 2025-08-19 DOI:10.1109/LSP.2025.3600376

Hui Wang;Yifan Yang;Shujie Liu;Jinyu Li;Lingwei Meng;Yanqing Liu;Jiaming Zhou;Haoqin Sun;Yan Lu;Yong Qin

{"title":"StreamMel：实时零镜头文本到语音通过交错连续自回归建模","authors":"Hui Wang;Yifan Yang;Shujie Liu;Jinyu Li;Lingwei Meng;Yanqing Liu;Jiaming Zhou;Haoqin Sun;Yan Lu;Yong Qin","doi":"10.1109/LSP.2025.3600376","DOIUrl":null,"url":null,"abstract":"Recent advances in zero-shot text-to-speech (TTS) synthesis have achieved high-quality speech generationfor unseen speakers, but most systems remain unsuitable for real-time applications because of their offline design. Current streaming TTS paradigms often rely on multi-stage pipelines and discrete representations,leading to increased computational cost and suboptimal system performance. In this work, we propose StreamMel, a pioneering single-stage streaming TTS framework that models continuous mel-spectrograms. By interleaving text tokens with acoustic frames, StreamMel enables low-latency, autoregressive synthesis while preserving high speaker similarity and naturalness. Experiments on LibriSpeech demonstrate that StreamMel outperforms existing streaming TTS baselines in both quality and latency. It even achieves performance comparable to offline systems while supporting efficient real-time generation, showcasing broad prospects for integration with real-time speech large language models.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"3530-3534"},"PeriodicalIF":3.9000,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"StreamMel: Real-Time Zero-Shot Text-to-Speech Via Interleaved Continuous Autoregressive Modeling\",\"authors\":\"Hui Wang;Yifan Yang;Shujie Liu;Jinyu Li;Lingwei Meng;Yanqing Liu;Jiaming Zhou;Haoqin Sun;Yan Lu;Yong Qin\",\"doi\":\"10.1109/LSP.2025.3600376\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent advances in zero-shot text-to-speech (TTS) synthesis have achieved high-quality speech generationfor unseen speakers, but most systems remain unsuitable for real-time applications because of their offline design. Current streaming TTS paradigms often rely on multi-stage pipelines and discrete representations,leading to increased computational cost and suboptimal system performance. In this work, we propose StreamMel, a pioneering single-stage streaming TTS framework that models continuous mel-spectrograms. By interleaving text tokens with acoustic frames, StreamMel enables low-latency, autoregressive synthesis while preserving high speaker similarity and naturalness. Experiments on LibriSpeech demonstrate that StreamMel outperforms existing streaming TTS baselines in both quality and latency. It even achieves performance comparable to offline systems while supporting efficient real-time generation, showcasing broad prospects for integration with real-time speech large language models.\",\"PeriodicalId\":13154,\"journal\":{\"name\":\"IEEE Signal Processing Letters\",\"volume\":\"32 \",\"pages\":\"3530-3534\"},\"PeriodicalIF\":3.9000,\"publicationDate\":\"2025-08-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Signal Processing Letters\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11129630/\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Signal Processing Letters","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/11129630/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

摘要

最近在零镜头文本到语音（TTS）合成方面的进展已经为看不见的说话者实现了高质量的语音生成，但大多数系统由于其离线设计而不适合实时应用。当前的流TTS范式通常依赖于多级管道和离散表示，导致计算成本增加和系统性能次优。在这项工作中，我们提出了StreamMel，这是一个开创性的单级流TTS框架，可以模拟连续mel谱图。通过将文本标记与声学帧交错，StreamMel实现低延迟，自回归合成，同时保持高扬声器相似性和自然性。在librisspeech上的实验表明，StreamMel在质量和延迟方面都优于现有的流TTS基线。它在支持高效实时生成的同时，甚至实现了与离线系统相当的性能，展示了与实时语音大型语言模型集成的广阔前景。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

StreamMel: Real-Time Zero-Shot Text-to-Speech Via Interleaved Continuous Autoregressive Modeling

Recent advances in zero-shot text-to-speech (TTS) synthesis have achieved high-quality speech generationfor unseen speakers, but most systems remain unsuitable for real-time applications because of their offline design. Current streaming TTS paradigms often rely on multi-stage pipelines and discrete representations,leading to increased computational cost and suboptimal system performance. In this work, we propose StreamMel, a pioneering single-stage streaming TTS framework that models continuous mel-spectrograms. By interleaving text tokens with acoustic frames, StreamMel enables low-latency, autoregressive synthesis while preserving high speaker similarity and naturalness. Experiments on LibriSpeech demonstrate that StreamMel outperforms existing streaming TTS baselines in both quality and latency. It even achieves performance comparable to offline systems while supporting efficient real-time generation, showcasing broad prospects for integration with real-time speech large language models.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Signal Processing Letters 工程技术-工程：电子与电气

CiteScore

7.40

自引率

12.80%

发文量

339

审稿时长

2.8 months

期刊介绍： The IEEE Signal Processing Letters is a monthly, archival publication designed to provide rapid dissemination of original, cutting-edge ideas and timely, significant contributions in signal, image, speech, language and audio processing. Papers published in the Letters can be presented within one year of their appearance in signal processing conferences such as ICASSP, GlobalSIP and ICIP, and also in several workshop organized by the Signal Processing Society.