Mixture of Emotion Dependent Experts: Facial Expressions Recognition in Videos Through Stacked Expert Models

IF 2.9 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC
Ali N. Salman;Karen Rosero;Lucas Goncalves;Carlos Busso
{"title":"Mixture of Emotion Dependent Experts: Facial Expressions Recognition in Videos Through Stacked Expert Models","authors":"Ali N. Salman;Karen Rosero;Lucas Goncalves;Carlos Busso","doi":"10.1109/OJSP.2025.3530793","DOIUrl":null,"url":null,"abstract":"Recent advancements in <italic>dynamic facial expression recognition</i> (DFER) have predominantly utilized static features, which are theoretically inferior to dynamic features. However, models fully trained with dynamic features often suffer from over-fitting due to the limited size and diversity of the training data for fully <italic>supervised learning</i> (SL) models. A significant challenge with existing models based on static features in recognizing emotions from videos is their tendency to form biased representations, often unbalanced or skewed towards more prevalent or basic emotional features present in the static domain, especially with posed expression. Therefore, this approach under-represents the nuances present in the dynamic domain. To address this issue, our study introduces a novel approach that we refer to as <italic>mixture of emotion-dependent experts</i> (MoEDE). This strategy relies on emotion-specific feature extractors to produce more diverse emotional static features to train DFER systems. Each emotion-dependent expert focuses exclusively on one emotional category, formulating the problem as binary classifiers. Our DFER model combines these static representations with recurrent models, modeling their temporal relationships. The proposed MoEDE DFER approach achieves a macro F1-score of 74.5%, marking a significant improvement over the baseline, which presented a macro F1-score of 70.9% . The DFER baseline is similar to MoEDE, but it uses a single static feature extractor rather than stacked extractors. Additionally, our proposed approach shows consistent improvements compared to other four popular baselines.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"323-332"},"PeriodicalIF":2.9000,"publicationDate":"2025-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10843404","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE open journal of signal processing","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10843404/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

Abstract

Recent advancements in dynamic facial expression recognition (DFER) have predominantly utilized static features, which are theoretically inferior to dynamic features. However, models fully trained with dynamic features often suffer from over-fitting due to the limited size and diversity of the training data for fully supervised learning (SL) models. A significant challenge with existing models based on static features in recognizing emotions from videos is their tendency to form biased representations, often unbalanced or skewed towards more prevalent or basic emotional features present in the static domain, especially with posed expression. Therefore, this approach under-represents the nuances present in the dynamic domain. To address this issue, our study introduces a novel approach that we refer to as mixture of emotion-dependent experts (MoEDE). This strategy relies on emotion-specific feature extractors to produce more diverse emotional static features to train DFER systems. Each emotion-dependent expert focuses exclusively on one emotional category, formulating the problem as binary classifiers. Our DFER model combines these static representations with recurrent models, modeling their temporal relationships. The proposed MoEDE DFER approach achieves a macro F1-score of 74.5%, marking a significant improvement over the baseline, which presented a macro F1-score of 70.9% . The DFER baseline is similar to MoEDE, but it uses a single static feature extractor rather than stacked extractors. Additionally, our proposed approach shows consistent improvements compared to other four popular baselines.
情感依赖专家的混合:通过堆叠专家模型识别视频中的面部表情
动态面部表情识别(DFER)的最新进展主要是利用静态特征,这在理论上不如动态特征。然而,由于全监督学习(fully supervised learning, SL)模型的训练数据的大小和多样性有限,使用动态特征完全训练的模型往往会出现过拟合的问题。基于静态特征的现有模型在从视频中识别情感时面临的一个重大挑战是,它们倾向于形成有偏见的表征,通常不平衡或偏向于静态领域中更普遍或基本的情感特征,特别是在摆姿势表达时。因此,这种方法没有充分反映动态领域中存在的细微差别。为了解决这个问题,我们的研究引入了一种新的方法,我们称之为情感依赖专家的混合物(MoEDE)。该策略依赖于情感特定特征提取器产生更多样化的情感静态特征来训练DFER系统。每个依赖情绪的专家只关注一种情绪类别,将问题表述为二元分类器。我们的DFER模型将这些静态表示与循环模型结合起来,对它们的时间关系进行建模。提出的MoEDE DFER方法的宏观f1得分为74.5%,比基线的宏观f1得分为70.9%有显著提高。DFER基线类似于MoEDE,但它使用单个静态特征提取器而不是堆叠提取器。此外,与其他四种流行的基线相比,我们提出的方法显示出一致的改进。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
5.30
自引率
0.00%
发文量
0
审稿时长
22 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信