Improving Pre-Trained Model-Based Speech Emotion Recognition From a Low-Level Speech Feature Perspective

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia Pub Date : 2024-06-05 DOI:10.1109/TMM.2024.3410133

Ke Liu;Jiwei Wei;Jie Zou;Peng Wang;Yang Yang;Heng Tao Shen

{"title":"Improving Pre-Trained Model-Based Speech Emotion Recognition From a Low-Level Speech Feature Perspective","authors":"Ke Liu;Jiwei Wei;Jie Zou;Peng Wang;Yang Yang;Heng Tao Shen","doi":"10.1109/TMM.2024.3410133","DOIUrl":null,"url":null,"abstract":"Multi-view speech emotion recognition (SER) based on the pre-trained model has gained attention in the last two years, which shows great potential in improving the model performance in speaker-independent scenarios. However, the existing work either relies on various fine-tuning methods or uses excessive feature views with complex fusion strategies, causing the increase of complexity with limited performance benefit. In this paper, we improve multi-view SER based on the pre-trained model from the perspective of a low-level speech feature. Specifically, we forgo fine-tuning the pre-trained model and instead focus on learning effective features hidden in the low-level speech feature mel-scale frequency cepstral coefficient (MFCC). We propose a \n<bold>t\nwo-\n<bold>s\ntream \n<bold>p\nooling \n<bold>c\nhannel \n<bold>a\nttention (\n<bold>TsPCA\n) module to discriminatively weight the channel dimensions of the features derived from MFCC. This module enables inter-channel interaction and learning of emotion sequence information across channels. Furthermore, we design a simple but effective feature view fusion strategy to learn robust representations. In the comparison experiments, our method achieves the WA and UA of 73.97%/74.69% and 74.61%/75.66% on the IEMOCAP dataset, 97.21% and 97.11% on the Emo-DB dataset, 77.08% and 77.34% on the RAVDESS dataset, and 74.38% and 71.43% on the SAVEE dataset. Extensive experiments on the four datasets demonstrate that our method consistently surpasses existing methods and achieves a new State-of-the-Art result.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10623-10636"},"PeriodicalIF":8.4000,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10549860/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Multi-view speech emotion recognition (SER) based on the pre-trained model has gained attention in the last two years, which shows great potential in improving the model performance in speaker-independent scenarios. However, the existing work either relies on various fine-tuning methods or uses excessive feature views with complex fusion strategies, causing the increase of complexity with limited performance benefit. In this paper, we improve multi-view SER based on the pre-trained model from the perspective of a low-level speech feature. Specifically, we forgo fine-tuning the pre-trained model and instead focus on learning effective features hidden in the low-level speech feature mel-scale frequency cepstral coefficient (MFCC). We propose a t wo- s tream p ooling c hannel a ttention ( TsPCA ) module to discriminatively weight the channel dimensions of the features derived from MFCC. This module enables inter-channel interaction and learning of emotion sequence information across channels. Furthermore, we design a simple but effective feature view fusion strategy to learn robust representations. In the comparison experiments, our method achieves the WA and UA of 73.97%/74.69% and 74.61%/75.66% on the IEMOCAP dataset, 97.21% and 97.11% on the Emo-DB dataset, 77.08% and 77.34% on the RAVDESS dataset, and 74.38% and 71.43% on the SAVEE dataset. Extensive experiments on the four datasets demonstrate that our method consistently surpasses existing methods and achieves a new State-of-the-Art result.

查看原文本刊更多论文

从低级语音特征角度改进基于预训练模型的语音情感识别

基于预训练模型的多视图语音情感识别（SER）在近两年备受关注，它在提高与说话人无关的场景中的模型性能方面显示出巨大的潜力。然而，现有的工作要么依赖于各种微调方法，要么使用过多的特征视图和复杂的融合策略，导致复杂度增加而性能收益有限。在本文中，我们从低级语音特征的角度出发，改进了基于预训练模型的多视角 SER。具体来说，我们放弃了对预训练模型的微调，转而专注于学习隐藏在低级语音特征梅尔尺度频率倒频谱系数（MFCC）中的有效特征。我们提出了一种双流汇集信道注意（TsPCA）模块，用于对从 MFCC 中提取的特征的信道维度进行判别加权。该模块实现了信道间的交互和跨信道情感序列信息的学习。此外，我们还设计了一种简单而有效的特征视图融合策略，以学习稳健的表征。在对比实验中，我们的方法在 IEMOCAP 数据集上实现了 73.97%/74.69% 和 74.61%/75.66% 的 WA 和 UA，在 Emo-DB 数据集上实现了 97.21% 和 97.11%的 WA 和 UA，在 RAVDESS 数据集上实现了 77.08% 和 77.34% 的 WA 和 UA，在 SAVEE 数据集上实现了 74.38% 和 71.43% 的 WA 和 UA。在这四个数据集上进行的广泛实验表明，我们的方法始终超越现有方法，取得了全新的技术成果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.