基于svm的多模态特征融合人脸视频情感识别

IF 9.8 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Affective Computing Pub Date : 2025-01-23 DOI:10.1109/TAFFC.2025.3528636

Jindi Bao;Jianjun Qian;Jian Yang

{"title":"基于svm的多模态特征融合人脸视频情感识别","authors":"Jindi Bao;Jianjun Qian;Jian Yang","doi":"10.1109/TAFFC.2025.3528636","DOIUrl":null,"url":null,"abstract":"Multimodal emotion recognition based on facial videos aims to extract features from different modalities to identify human emotions. The previous work focus on designing various fusion schemes to combine heterogeneous modal data. However, most studies have overlooked the role of different modalities in emotion recognition and have not fully utilized the intrinsic connections between modalities. Furthermore, the multimodal data from facial videos also contain various distractions bad for emotion analysis. How to reduce the impact of distractions and enable a model to mine effective information for emotion recognition from different modalities is still a challenge problem. To address above issue, we propose a SVD-guided multimodal feature fusion method based on facial video for emotion recognition, which uses a hierarchical fusion mechanism and adopts different loss strategies at each level to learn multimodal feature representation. Specifically, we fuse the facial expression and rPPG signal (or Point-of-Gaze) by using the weak supervision strategy and contrastive learning. Subsequently, the fused feature of facial expression and rPPG signal and the fused feature of facial expression and Point-of-Gaze are combined together to construct the unified multimodal feature matrix. Based on this, Singular Value Decomposition (SVD) is used to refine the redundancy information caused by the multimodal fusion and guide the neural network to learn discriminative emotion feature. At the same time, a consistent loss is developed to enhance the multimodal representation. Experiments on three public datasets show that the proposed method achieves better results over the compared methods.","PeriodicalId":13131,"journal":{"name":"IEEE Transactions on Affective Computing","volume":"16 3","pages":"1705-1715"},"PeriodicalIF":9.8000,"publicationDate":"2025-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"SVD-Guided Multimodal Feature Fusion for Emotion Recognition From Facial Videos\",\"authors\":\"Jindi Bao;Jianjun Qian;Jian Yang\",\"doi\":\"10.1109/TAFFC.2025.3528636\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Multimodal emotion recognition based on facial videos aims to extract features from different modalities to identify human emotions. The previous work focus on designing various fusion schemes to combine heterogeneous modal data. However, most studies have overlooked the role of different modalities in emotion recognition and have not fully utilized the intrinsic connections between modalities. Furthermore, the multimodal data from facial videos also contain various distractions bad for emotion analysis. How to reduce the impact of distractions and enable a model to mine effective information for emotion recognition from different modalities is still a challenge problem. To address above issue, we propose a SVD-guided multimodal feature fusion method based on facial video for emotion recognition, which uses a hierarchical fusion mechanism and adopts different loss strategies at each level to learn multimodal feature representation. Specifically, we fuse the facial expression and rPPG signal (or Point-of-Gaze) by using the weak supervision strategy and contrastive learning. Subsequently, the fused feature of facial expression and rPPG signal and the fused feature of facial expression and Point-of-Gaze are combined together to construct the unified multimodal feature matrix. Based on this, Singular Value Decomposition (SVD) is used to refine the redundancy information caused by the multimodal fusion and guide the neural network to learn discriminative emotion feature. At the same time, a consistent loss is developed to enhance the multimodal representation. Experiments on three public datasets show that the proposed method achieves better results over the compared methods.\",\"PeriodicalId\":13131,\"journal\":{\"name\":\"IEEE Transactions on Affective Computing\",\"volume\":\"16 3\",\"pages\":\"1705-1715\"},\"PeriodicalIF\":9.8000,\"publicationDate\":\"2025-01-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Affective Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10850745/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Affective Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10850745/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

基于人脸视频的多模态情绪识别旨在从不同的模态中提取特征来识别人类的情绪。以往的工作主要集中在设计各种融合方案来结合异构模态数据。然而，大多数研究忽视了不同模态在情绪识别中的作用，并没有充分利用模态之间的内在联系。此外，面部视频的多模态数据还包含各种不利于情绪分析的干扰因素。如何减少干扰的影响，使模型能够从不同的模式中挖掘有效的信息进行情绪识别仍然是一个具有挑战性的问题。针对上述问题，本文提出了一种基于svm的人脸视频多模态特征融合方法，该方法采用分层融合机制，并在每一层采用不同的损失策略学习多模态特征表示。具体来说，我们使用弱监督策略和对比学习来融合面部表情和rPPG信号（或注视点）。随后，将面部表情与rPPG信号的融合特征、面部表情与注视点的融合特征组合在一起，构建统一的多模态特征矩阵。在此基础上，利用奇异值分解（SVD）对多模态融合产生的冗余信息进行细化，引导神经网络学习判别性情绪特征。同时，提出了一种一致损失的方法来增强多模态表示。在三个公共数据集上的实验表明，该方法取得了较好的效果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

SVD-Guided Multimodal Feature Fusion for Emotion Recognition From Facial Videos

Multimodal emotion recognition based on facial videos aims to extract features from different modalities to identify human emotions. The previous work focus on designing various fusion schemes to combine heterogeneous modal data. However, most studies have overlooked the role of different modalities in emotion recognition and have not fully utilized the intrinsic connections between modalities. Furthermore, the multimodal data from facial videos also contain various distractions bad for emotion analysis. How to reduce the impact of distractions and enable a model to mine effective information for emotion recognition from different modalities is still a challenge problem. To address above issue, we propose a SVD-guided multimodal feature fusion method based on facial video for emotion recognition, which uses a hierarchical fusion mechanism and adopts different loss strategies at each level to learn multimodal feature representation. Specifically, we fuse the facial expression and rPPG signal (or Point-of-Gaze) by using the weak supervision strategy and contrastive learning. Subsequently, the fused feature of facial expression and rPPG signal and the fused feature of facial expression and Point-of-Gaze are combined together to construct the unified multimodal feature matrix. Based on this, Singular Value Decomposition (SVD) is used to refine the redundancy information caused by the multimodal fusion and guide the neural network to learn discriminative emotion feature. At the same time, a consistent loss is developed to enhance the multimodal representation. Experiments on three public datasets show that the proposed method achieves better results over the compared methods.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Affective Computing COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-COMPUTER SCIENCE, CYBERNETICS

CiteScore

15.00

自引率

6.20%

发文量

174

期刊介绍： The IEEE Transactions on Affective Computing is an international and interdisciplinary journal. Its primary goal is to share research findings on the development of systems capable of recognizing, interpreting, and simulating human emotions and related affective phenomena. The journal publishes original research on the underlying principles and theories that explain how and why affective factors shape human-technology interactions. It also focuses on how techniques for sensing and simulating affect can enhance our understanding of human emotions and processes. Additionally, the journal explores the design, implementation, and evaluation of systems that prioritize the consideration of affect in their usability. We also welcome surveys of existing work that provide new perspectives on the historical and future directions of this field.