Mixture of Emotion Dependent Experts: Facial Expressions Recognition in Videos Through Stacked Expert Models

IF 2.7 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE open journal of signal processing Pub Date : 2025-01-16 DOI:10.1109/OJSP.2025.3530793

Ali N. Salman;Karen Rosero;Lucas Goncalves;Carlos Busso

{"title":"Mixture of Emotion Dependent Experts: Facial Expressions Recognition in Videos Through Stacked Expert Models","authors":"Ali N. Salman;Karen Rosero;Lucas Goncalves;Carlos Busso","doi":"10.1109/OJSP.2025.3530793","DOIUrl":null,"url":null,"abstract":"Recent advancements in <italic>dynamic facial expression recognition</i> (DFER) have predominantly utilized static features, which are theoretically inferior to dynamic features. However, models fully trained with dynamic features often suffer from over-fitting due to the limited size and diversity of the training data for fully <italic>supervised learning</i> (SL) models. A significant challenge with existing models based on static features in recognizing emotions from videos is their tendency to form biased representations, often unbalanced or skewed towards more prevalent or basic emotional features present in the static domain, especially with posed expression. Therefore, this approach under-represents the nuances present in the dynamic domain. To address this issue, our study introduces a novel approach that we refer to as <italic>mixture of emotion-dependent experts</i> (MoEDE). This strategy relies on emotion-specific feature extractors to produce more diverse emotional static features to train DFER systems. Each emotion-dependent expert focuses exclusively on one emotional category, formulating the problem as binary classifiers. Our DFER model combines these static representations with recurrent models, modeling their temporal relationships. The proposed MoEDE DFER approach achieves a macro F1-score of 74.5%, marking a significant improvement over the baseline, which presented a macro F1-score of 70.9% . The DFER baseline is similar to MoEDE, but it uses a single static feature extractor rather than stacked extractors. Additionally, our proposed approach shows consistent improvements compared to other four popular baselines.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"323-332"},"PeriodicalIF":2.7000,"publicationDate":"2025-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10843404","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE open journal of signal processing","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10843404/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Recent advancements in dynamic facial expression recognition (DFER) have predominantly utilized static features, which are theoretically inferior to dynamic features. However, models fully trained with dynamic features often suffer from over-fitting due to the limited size and diversity of the training data for fully supervised learning (SL) models. A significant challenge with existing models based on static features in recognizing emotions from videos is their tendency to form biased representations, often unbalanced or skewed towards more prevalent or basic emotional features present in the static domain, especially with posed expression. Therefore, this approach under-represents the nuances present in the dynamic domain. To address this issue, our study introduces a novel approach that we refer to as mixture of emotion-dependent experts (MoEDE). This strategy relies on emotion-specific feature extractors to produce more diverse emotional static features to train DFER systems. Each emotion-dependent expert focuses exclusively on one emotional category, formulating the problem as binary classifiers. Our DFER model combines these static representations with recurrent models, modeling their temporal relationships. The proposed MoEDE DFER approach achieves a macro F1-score of 74.5%, marking a significant improvement over the baseline, which presented a macro F1-score of 70.9% . The DFER baseline is similar to MoEDE, but it uses a single static feature extractor rather than stacked extractors. Additionally, our proposed approach shows consistent improvements compared to other four popular baselines.

查看原文本刊更多论文

情感依赖专家的混合：通过堆叠专家模型识别视频中的面部表情

动态面部表情识别（DFER）的最新进展主要是利用静态特征，这在理论上不如动态特征。然而，由于全监督学习（fully supervised learning， SL）模型的训练数据的大小和多样性有限，使用动态特征完全训练的模型往往会出现过拟合的问题。基于静态特征的现有模型在从视频中识别情感时面临的一个重大挑战是，它们倾向于形成有偏见的表征，通常不平衡或偏向于静态领域中更普遍或基本的情感特征，特别是在摆姿势表达时。因此，这种方法没有充分反映动态领域中存在的细微差别。为了解决这个问题，我们的研究引入了一种新的方法，我们称之为情感依赖专家的混合物（MoEDE）。该策略依赖于情感特定特征提取器产生更多样化的情感静态特征来训练DFER系统。每个依赖情绪的专家只关注一种情绪类别，将问题表述为二元分类器。我们的DFER模型将这些静态表示与循环模型结合起来，对它们的时间关系进行建模。提出的MoEDE DFER方法的宏观f1得分为74.5%，比基线的宏观f1得分为70.9%有显著提高。DFER基线类似于MoEDE，但它使用单个静态特征提取器而不是堆叠提取器。此外，与其他四种流行的基线相比，我们提出的方法显示出一致的改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊