{"title":"基于svm的多模态特征融合人脸视频情感识别","authors":"Jindi Bao;Jianjun Qian;Jian Yang","doi":"10.1109/TAFFC.2025.3528636","DOIUrl":null,"url":null,"abstract":"Multimodal emotion recognition based on facial videos aims to extract features from different modalities to identify human emotions. The previous work focus on designing various fusion schemes to combine heterogeneous modal data. However, most studies have overlooked the role of different modalities in emotion recognition and have not fully utilized the intrinsic connections between modalities. Furthermore, the multimodal data from facial videos also contain various distractions bad for emotion analysis. How to reduce the impact of distractions and enable a model to mine effective information for emotion recognition from different modalities is still a challenge problem. To address above issue, we propose a SVD-guided multimodal feature fusion method based on facial video for emotion recognition, which uses a hierarchical fusion mechanism and adopts different loss strategies at each level to learn multimodal feature representation. Specifically, we fuse the facial expression and rPPG signal (or Point-of-Gaze) by using the weak supervision strategy and contrastive learning. Subsequently, the fused feature of facial expression and rPPG signal and the fused feature of facial expression and Point-of-Gaze are combined together to construct the unified multimodal feature matrix. Based on this, Singular Value Decomposition (SVD) is used to refine the redundancy information caused by the multimodal fusion and guide the neural network to learn discriminative emotion feature. At the same time, a consistent loss is developed to enhance the multimodal representation. Experiments on three public datasets show that the proposed method achieves better results over the compared methods.","PeriodicalId":13131,"journal":{"name":"IEEE Transactions on Affective Computing","volume":"16 3","pages":"1705-1715"},"PeriodicalIF":9.8000,"publicationDate":"2025-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"SVD-Guided Multimodal Feature Fusion for Emotion Recognition From Facial Videos\",\"authors\":\"Jindi Bao;Jianjun Qian;Jian Yang\",\"doi\":\"10.1109/TAFFC.2025.3528636\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Multimodal emotion recognition based on facial videos aims to extract features from different modalities to identify human emotions. The previous work focus on designing various fusion schemes to combine heterogeneous modal data. However, most studies have overlooked the role of different modalities in emotion recognition and have not fully utilized the intrinsic connections between modalities. Furthermore, the multimodal data from facial videos also contain various distractions bad for emotion analysis. How to reduce the impact of distractions and enable a model to mine effective information for emotion recognition from different modalities is still a challenge problem. To address above issue, we propose a SVD-guided multimodal feature fusion method based on facial video for emotion recognition, which uses a hierarchical fusion mechanism and adopts different loss strategies at each level to learn multimodal feature representation. Specifically, we fuse the facial expression and rPPG signal (or Point-of-Gaze) by using the weak supervision strategy and contrastive learning. Subsequently, the fused feature of facial expression and rPPG signal and the fused feature of facial expression and Point-of-Gaze are combined together to construct the unified multimodal feature matrix. Based on this, Singular Value Decomposition (SVD) is used to refine the redundancy information caused by the multimodal fusion and guide the neural network to learn discriminative emotion feature. At the same time, a consistent loss is developed to enhance the multimodal representation. Experiments on three public datasets show that the proposed method achieves better results over the compared methods.\",\"PeriodicalId\":13131,\"journal\":{\"name\":\"IEEE Transactions on Affective Computing\",\"volume\":\"16 3\",\"pages\":\"1705-1715\"},\"PeriodicalIF\":9.8000,\"publicationDate\":\"2025-01-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Affective Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10850745/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Affective Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10850745/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
SVD-Guided Multimodal Feature Fusion for Emotion Recognition From Facial Videos
Multimodal emotion recognition based on facial videos aims to extract features from different modalities to identify human emotions. The previous work focus on designing various fusion schemes to combine heterogeneous modal data. However, most studies have overlooked the role of different modalities in emotion recognition and have not fully utilized the intrinsic connections between modalities. Furthermore, the multimodal data from facial videos also contain various distractions bad for emotion analysis. How to reduce the impact of distractions and enable a model to mine effective information for emotion recognition from different modalities is still a challenge problem. To address above issue, we propose a SVD-guided multimodal feature fusion method based on facial video for emotion recognition, which uses a hierarchical fusion mechanism and adopts different loss strategies at each level to learn multimodal feature representation. Specifically, we fuse the facial expression and rPPG signal (or Point-of-Gaze) by using the weak supervision strategy and contrastive learning. Subsequently, the fused feature of facial expression and rPPG signal and the fused feature of facial expression and Point-of-Gaze are combined together to construct the unified multimodal feature matrix. Based on this, Singular Value Decomposition (SVD) is used to refine the redundancy information caused by the multimodal fusion and guide the neural network to learn discriminative emotion feature. At the same time, a consistent loss is developed to enhance the multimodal representation. Experiments on three public datasets show that the proposed method achieves better results over the compared methods.
期刊介绍:
The IEEE Transactions on Affective Computing is an international and interdisciplinary journal. Its primary goal is to share research findings on the development of systems capable of recognizing, interpreting, and simulating human emotions and related affective phenomena. The journal publishes original research on the underlying principles and theories that explain how and why affective factors shape human-technology interactions. It also focuses on how techniques for sensing and simulating affect can enhance our understanding of human emotions and processes. Additionally, the journal explores the design, implementation, and evaluation of systems that prioritize the consideration of affect in their usability. We also welcome surveys of existing work that provide new perspectives on the historical and future directions of this field.