{"title":"UnitDiff: A Unit-Diffusion Model for Code-Switching Speech Synthesis","authors":"Ke Chen;Zhihua Huang;Liang He;Yonghong Yan","doi":"10.1109/LSP.2025.3543456","DOIUrl":null,"url":null,"abstract":"Given the scarcity of Code-Switching (CS) datasets, most researchers synthesize CS speech using multiple monolingual datasets. However, this approach presents challenges in synthesizing CS speech, such as difficulty controlling the speaker's identity and causing low intelligibility of the generated speech. This letter proposes UnitDiff, a CS speech synthesis model based on the unit-diffusion framework. The model employs the self-supervised high-level representation ’soft unit' extracted from soft HuBERT to directly predict a clean mel-spectrogram <inline-formula><tex-math>$x_{0}$</tex-math></inline-formula>. This approach enhances control over speaker identity. A language tagging method is also introduced to improve speech intelligibility. Evaluation results validate the model's effectiveness in improving the intelligibility, speaker similarity, and speaker consistency of the generated CS speech.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"1051-1055"},"PeriodicalIF":3.2000,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Signal Processing Letters","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10891773/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Given the scarcity of Code-Switching (CS) datasets, most researchers synthesize CS speech using multiple monolingual datasets. However, this approach presents challenges in synthesizing CS speech, such as difficulty controlling the speaker's identity and causing low intelligibility of the generated speech. This letter proposes UnitDiff, a CS speech synthesis model based on the unit-diffusion framework. The model employs the self-supervised high-level representation ’soft unit' extracted from soft HuBERT to directly predict a clean mel-spectrogram $x_{0}$. This approach enhances control over speaker identity. A language tagging method is also introduced to improve speech intelligibility. Evaluation results validate the model's effectiveness in improving the intelligibility, speaker similarity, and speaker consistency of the generated CS speech.
期刊介绍:
The IEEE Signal Processing Letters is a monthly, archival publication designed to provide rapid dissemination of original, cutting-edge ideas and timely, significant contributions in signal, image, speech, language and audio processing. Papers published in the Letters can be presented within one year of their appearance in signal processing conferences such as ICASSP, GlobalSIP and ICIP, and also in several workshop organized by the Signal Processing Society.