A speech emotion recognition method based on DST-GDCC and text-to-speech data augmentation

IF 3 3区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

Digital Signal Processing Pub Date : 2025-10-01 DOI:10.1016/j.dsp.2025.105636

Ning Li , Junjie Hou , Wenjiao Zhang , Yanan Zhuang , Qianqian Xu , Haohan Yong

{"title":"A speech emotion recognition method based on DST-GDCC and text-to-speech data augmentation","authors":"Ning Li , Junjie Hou , Wenjiao Zhang , Yanan Zhuang , Qianqian Xu , Haohan Yong","doi":"10.1016/j.dsp.2025.105636","DOIUrl":null,"url":null,"abstract":"<div><div>Speech Emotion Recognition (SER) is a critical component of human-machine interaction, yet it confronts two fundamental challenges: limited feature extraction capabilities and data scarcity. This paper proposes a unified framework that synergistically addresses both issues through the co-design of a novel SER model and a high-quality data augmentation strategy. At its core, the Deformable Speech Transformer (DST) and the Gated Dilation Causal Convolution (GDCC) are introduced, which are combined to form the DST-GDCC model for superior feature extraction. The DST component adaptively captures multi-granular acoustic features, while the GDCC module explicitly models the spatiotemporal causality of speech emotions. However, the full potential of such an advanced model is often constrained by scarce training data. To overcome this limitation, a Text-to-Speech (TTS) data augmentation method is incorporated, leveraging a pre-trained GPT-SoVITS model to synthesize high-fidelity, emotion-conditioned speech samples. Crucially, these two components form a virtuous cycle: the powerful discriminative ability of the DST-GDCC model is leveraged in a dual-stage screening mechanism to ensure the quality of the synthetic data, while the expanded, high-quality dataset, in turn, enables the model to realize its full potential. Experimental results demonstrate the framework's effectiveness. The DST-GDCC model itself achieves significant accuracy improvements over baselines (2.66% on IEMOCAP, 5.02% on MELD, 5.83% on CASIA). More importantly, the synergistic integration with TTS data augmentation yields further gains of 3.13% on IEMOCAP and 3.33% on CASIA, validating the framework's capability to systematically elevate SER performance.</div></div>","PeriodicalId":51011,"journal":{"name":"Digital Signal Processing","volume":"168 ","pages":"Article 105636"},"PeriodicalIF":3.0000,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S105120042500658X","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Speech Emotion Recognition (SER) is a critical component of human-machine interaction, yet it confronts two fundamental challenges: limited feature extraction capabilities and data scarcity. This paper proposes a unified framework that synergistically addresses both issues through the co-design of a novel SER model and a high-quality data augmentation strategy. At its core, the Deformable Speech Transformer (DST) and the Gated Dilation Causal Convolution (GDCC) are introduced, which are combined to form the DST-GDCC model for superior feature extraction. The DST component adaptively captures multi-granular acoustic features, while the GDCC module explicitly models the spatiotemporal causality of speech emotions. However, the full potential of such an advanced model is often constrained by scarce training data. To overcome this limitation, a Text-to-Speech (TTS) data augmentation method is incorporated, leveraging a pre-trained GPT-SoVITS model to synthesize high-fidelity, emotion-conditioned speech samples. Crucially, these two components form a virtuous cycle: the powerful discriminative ability of the DST-GDCC model is leveraged in a dual-stage screening mechanism to ensure the quality of the synthetic data, while the expanded, high-quality dataset, in turn, enables the model to realize its full potential. Experimental results demonstrate the framework's effectiveness. The DST-GDCC model itself achieves significant accuracy improvements over baselines (2.66% on IEMOCAP, 5.02% on MELD, 5.83% on CASIA). More importantly, the synergistic integration with TTS data augmentation yields further gains of 3.13% on IEMOCAP and 3.33% on CASIA, validating the framework's capability to systematically elevate SER performance.

查看原文本刊更多论文

基于DST-GDCC和文本到语音数据增强的语音情感识别方法

语音情感识别（SER）是人机交互的重要组成部分，但它面临着两个根本性的挑战：特征提取能力有限和数据稀缺。本文提出了一个统一的框架，通过共同设计一个新的SER模型和一个高质量的数据增强策略来协同解决这两个问题。其核心是引入可变形语音转换器（DST）和门控扩张因果卷积（GDCC），并将两者结合形成DST-GDCC模型，以获得更好的特征提取。DST组件自适应捕获多颗粒声学特征，而GDCC模块明确建模语音情绪的时空因果关系。然而，这种先进模型的全部潜力往往受到缺乏训练数据的限制。为了克服这一限制，本文采用了文本到语音（TTS）数据增强方法，利用预训练的GPT-SoVITS模型合成高保真度、情绪条件的语音样本。至关重要的是，这两个组成部分形成了一个良性循环：在双阶段筛选机制中利用DST-GDCC模型强大的判别能力来确保合成数据的质量，而扩展的高质量数据集反过来又使模型充分发挥其潜力。实验结果证明了该框架的有效性。DST-GDCC模型本身在基线上取得了显着的精度提高（IEMOCAP为2.66%，MELD为5.02%，CASIA为5.83%）。更重要的是，与TTS数据增强的协同集成在IEMOCAP上进一步提高了3.13%，在CASIA上提高了3.33%，验证了该框架系统提高SER性能的能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Digital Signal Processing 工程技术-工程：电子与电气

CiteScore

5.30

自引率

17.20%

发文量

435

审稿时长

66 days

期刊介绍： Digital Signal Processing: A Review Journal is one of the oldest and most established journals in the field of signal processing yet it aims to be the most innovative. The Journal invites top quality research articles at the frontiers of research in all aspects of signal processing. Our objective is to provide a platform for the publication of ground-breaking research in signal processing with both academic and industrial appeal. The journal has a special emphasis on statistical signal processing methodology such as Bayesian signal processing, and encourages articles on emerging applications of signal processing such as: • big data• machine learning• internet of things• information security• systems biology and computational biology,• financial time series analysis,• autonomous vehicles,• quantum computing,• neuromorphic engineering,• human-computer interaction and intelligent user interfaces,• environmental signal processing,• geophysical signal processing including seismic signal processing,• chemioinformatics and bioinformatics,• audio, visual and performance arts,• disaster management and prevention,• renewable energy,