Improving speech emotion recognition by fusing self-supervised learning and spectral features via mixture of experts

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data & Knowledge Engineering Pub Date : 2023-12-13 DOI:10.1016/j.datak.2023.102262

Jonghwan Hyeon, Yung-Hwan Oh, Young-Jun Lee, Ho-Jin Choi

{"title":"Improving speech emotion recognition by fusing self-supervised learning and spectral features via mixture of experts","authors":"Jonghwan Hyeon, Yung-Hwan Oh, Young-Jun Lee, Ho-Jin Choi","doi":"10.1016/j.datak.2023.102262","DOIUrl":null,"url":null,"abstract":"<div><p>Speech Emotion Recognition (SER) is an important area of research in speech processing that aims to identify and classify emotional states conveyed through speech signals. Recent studies have shown considerable performance in SER by exploiting deep contextualized speech representations from self-supervised learning (SSL) models. However, SSL models pre-trained on clean speech data may not perform well on emotional speech data due to the domain shift problem. To address this problem, this paper proposes a novel approach that simultaneously exploits an SSL model and a domain-agnostic spectral feature (SF) through the Mixture of Experts (MoE) technique. The proposed approach achieves the state-of-the-art performance on weighted accuracy compared to other methods in the IEMOCAP dataset. Moreover, this paper demonstrates the existence of the domain shift problem of SSL models in the SER task.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"150 ","pages":"Article 102262"},"PeriodicalIF":2.7000,"publicationDate":"2023-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X23001222/pdfft?md5=48b44d06659bb1ef2a62c484d7369d5b&pid=1-s2.0-S0169023X23001222-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data & Knowledge Engineering","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0169023X23001222","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Speech Emotion Recognition (SER) is an important area of research in speech processing that aims to identify and classify emotional states conveyed through speech signals. Recent studies have shown considerable performance in SER by exploiting deep contextualized speech representations from self-supervised learning (SSL) models. However, SSL models pre-trained on clean speech data may not perform well on emotional speech data due to the domain shift problem. To address this problem, this paper proposes a novel approach that simultaneously exploits an SSL model and a domain-agnostic spectral feature (SF) through the Mixture of Experts (MoE) technique. The proposed approach achieves the state-of-the-art performance on weighted accuracy compared to other methods in the IEMOCAP dataset. Moreover, this paper demonstrates the existence of the domain shift problem of SSL models in the SER task.

查看原文本刊更多论文

通过专家混合物融合自监督学习和频谱特征，提高语音情感识别能力

语音情绪识别(SER)是语音处理领域的一个重要研究领域，旨在识别和分类通过语音信号传递的情绪状态。最近的研究表明，通过利用来自自监督学习(SSL)模型的深度上下文化语音表示，在SER中取得了相当大的性能。然而，由于域移位问题，在干净语音数据上预训练的SSL模型在情感语音数据上可能表现不佳。为了解决这个问题，本文提出了一种新的方法，通过混合专家(MoE)技术同时利用SSL模型和领域不可知论光谱特征(SF)。与IEMOCAP数据集的其他方法相比，该方法在加权精度方面达到了最先进的性能。此外，本文还证明了SSL模型在SER任务中存在领域转移问题。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Data & Knowledge Engineering 工程技术-计算机：人工智能

CiteScore

5.00

自引率

0.00%

发文量

审稿时长

6 months

期刊介绍： Data & Knowledge Engineering (DKE) stimulates the exchange of ideas and interaction between these two related fields of interest. DKE reaches a world-wide audience of researchers, designers, managers and users. The major aim of the journal is to identify, investigate and analyze the underlying principles in the design and effective use of these systems.