{"title":"Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization","authors":"Xiaoxue Gao, Chen Zhang, Yiming Chen, Huayun Zhang, Nancy F. Chen","doi":"arxiv-2409.10157","DOIUrl":null,"url":null,"abstract":"Current emotional text-to-speech (TTS) models predominantly conduct\nsupervised training to learn the conversion from text and desired emotion to\nits emotional speech, focusing on a single emotion per text-speech pair. These\nmodels only learn the correct emotional outputs without fully comprehending\nother emotion characteristics, which limits their capabilities of capturing the\nnuances between different emotions. We propose a controllable Emo-DPO approach,\nwhich employs direct preference optimization to differentiate subtle emotional\nnuances between emotions through optimizing towards preferred emotions over\nless preferred emotional ones. Instead of relying on traditional neural\narchitectures used in existing emotional TTS models, we propose utilizing the\nemotion-aware LLM-TTS neural architecture to leverage LLMs' in-context learning\nand instruction-following capabilities. Comprehensive experiments confirm that\nour proposed method outperforms the existing baselines.","PeriodicalId":501034,"journal":{"name":"arXiv - EE - Signal Processing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Signal Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.10157","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Current emotional text-to-speech (TTS) models predominantly conduct
supervised training to learn the conversion from text and desired emotion to
its emotional speech, focusing on a single emotion per text-speech pair. These
models only learn the correct emotional outputs without fully comprehending
other emotion characteristics, which limits their capabilities of capturing the
nuances between different emotions. We propose a controllable Emo-DPO approach,
which employs direct preference optimization to differentiate subtle emotional
nuances between emotions through optimizing towards preferred emotions over
less preferred emotional ones. Instead of relying on traditional neural
architectures used in existing emotional TTS models, we propose utilizing the
emotion-aware LLM-TTS neural architecture to leverage LLMs' in-context learning
and instruction-following capabilities. Comprehensive experiments confirm that
our proposed method outperforms the existing baselines.