Peiyun Xue , Xiang Gao , Jing Bai , Zhenan Dong , Zhiyu Wang , Jiangshuai Xu
{"title":"A dynamic-static feature fusion learning network for speech emotion recognition","authors":"Peiyun Xue , Xiang Gao , Jing Bai , Zhenan Dong , Zhiyu Wang , Jiangshuai Xu","doi":"10.1016/j.neucom.2025.129836","DOIUrl":null,"url":null,"abstract":"<div><div>Speech is a paramount mode of human communication, and enhancing the quality and fluency of Human-Computer Interaction (HCI) greatly benefits from the significant contribution of Speech Emotion Recognition (SER). Feature representation poses a persistent challenge in SER. A single feature is difficult to adequately represent speech emotion, while directly concatenating multiple features may overlook the complementary nature and introduce interference due to redundant information. Towards these difficulties, this paper proposes a Multi-feature Learning network based on Dynamic-Static feature Fusion (ML-DSF) to obtain an effective hybrid feature representation for SER. Firstly, a Time-Frequency domain Self-Calibration Module (TFSC) is proposed to help the traditional convolutional neural networks in extracting static image features from the Log-Mel spectrograms. Then, a Lightweight Temporal Convolutional Network (L-TCNet) is used to acquire multi-scale dynamic temporal causal knowledge from the Mel Frequency Cepstrum Coefficients (MFCC). At last, both extracted features groups are fed into a connection attention module, optimized by Principal Component Analysis (PCA), facilitating emotion classification by reducing redundant information and enhancing the complementary information between features. For ensuring the independence of feature extraction, this paper adopts the training separation strategy. Evaluating the proposed model on two public datasets yielded a Weighted Accuracy (WA) of 93.33 % and an Unweighted Accuracy (UA) of 93.12 % on the <em>RAVDESS</em> dataset, and 94.95 % WA and 94.56 % UA on the <em>EmoDB</em> dataset. The obtained results outperformed the State-Of-The-Art (SOTA) findings. Meanwhile, the effectiveness of each module is validated by ablation experiments, and the generalization analysis is carried out on the cross-corpus SER tasks.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"633 ","pages":"Article 129836"},"PeriodicalIF":5.5000,"publicationDate":"2025-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231225005089","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Speech is a paramount mode of human communication, and enhancing the quality and fluency of Human-Computer Interaction (HCI) greatly benefits from the significant contribution of Speech Emotion Recognition (SER). Feature representation poses a persistent challenge in SER. A single feature is difficult to adequately represent speech emotion, while directly concatenating multiple features may overlook the complementary nature and introduce interference due to redundant information. Towards these difficulties, this paper proposes a Multi-feature Learning network based on Dynamic-Static feature Fusion (ML-DSF) to obtain an effective hybrid feature representation for SER. Firstly, a Time-Frequency domain Self-Calibration Module (TFSC) is proposed to help the traditional convolutional neural networks in extracting static image features from the Log-Mel spectrograms. Then, a Lightweight Temporal Convolutional Network (L-TCNet) is used to acquire multi-scale dynamic temporal causal knowledge from the Mel Frequency Cepstrum Coefficients (MFCC). At last, both extracted features groups are fed into a connection attention module, optimized by Principal Component Analysis (PCA), facilitating emotion classification by reducing redundant information and enhancing the complementary information between features. For ensuring the independence of feature extraction, this paper adopts the training separation strategy. Evaluating the proposed model on two public datasets yielded a Weighted Accuracy (WA) of 93.33 % and an Unweighted Accuracy (UA) of 93.12 % on the RAVDESS dataset, and 94.95 % WA and 94.56 % UA on the EmoDB dataset. The obtained results outperformed the State-Of-The-Art (SOTA) findings. Meanwhile, the effectiveness of each module is validated by ablation experiments, and the generalization analysis is carried out on the cross-corpus SER tasks.
期刊介绍:
Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.