SG-TE: Spatial Guidance and Temporal Enhancement Network for Facial-Bodily Emotion Recognition

IF 8.4 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

CAAI Transactions on Intelligence Technology Pub Date : 2025-03-26 DOI:10.1049/cit2.70006

Zhong Huang, Danni Zhang, Fuji Ren, Min Hu, Juan Liu, Haitao Yu

{"title":"SG-TE: Spatial Guidance and Temporal Enhancement Network for Facial-Bodily Emotion Recognition","authors":"Zhong Huang, Danni Zhang, Fuji Ren, Min Hu, Juan Liu, Haitao Yu","doi":"10.1049/cit2.70006","DOIUrl":null,"url":null,"abstract":"<p>To overcome the deficiencies of single-modal emotion recognition based on facial expression or bodily posture in natural scenes, a spatial guidance and temporal enhancement (SG-TE) network is proposed for facial-bodily emotion recognition. First, ResNet50, DNN and spatial ransformer models are used to capture facial texture vectors, bodily skeleton vectors and whole-body geometric vectors, and an intraframe correlation attention guidance (S-CAG) mechanism, which guides the facial texture vector and the bodily skeleton vector by the whole-body geometric vector, is designed to exploit the spatial potential emotional correlation between face and posture. Second, an interframe significant segment enhancement (T-SSE) structure is embedded into a temporal transformer to enhance high emotional intensity frame information and avoid emotional asynchrony. Finally, an adaptive weight assignment (M-AWA) strategy is constructed to realise facial-bodily fusion. The experimental results on the BabyRobot Emotion Dataset (BRED) and Context-Aware Emotion Recognition (CAER) dataset indicate that the proposed network reaches accuracies of 81.61% and 89.39%, which are 9.61% and 9.46% higher than those of the baseline network, respectively. Compared with the state-of-the-art methods, the proposed method achieves 7.73% and 20.57% higher accuracy than single-modal methods based on facial expression or bodily posture, respectively, and 2.16% higher accuracy than the dual-modal methods based on facial-bodily fusion. Therefore, the proposed method, which adaptively fuses the complementary information of face and posture, improves the quality of emotion recognition in real-world scenarios.</p>","PeriodicalId":46211,"journal":{"name":"CAAI Transactions on Intelligence Technology","volume":"10 3","pages":"871-890"},"PeriodicalIF":8.4000,"publicationDate":"2025-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cit2.70006","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"CAAI Transactions on Intelligence Technology","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1049/cit2.70006","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

To overcome the deficiencies of single-modal emotion recognition based on facial expression or bodily posture in natural scenes, a spatial guidance and temporal enhancement (SG-TE) network is proposed for facial-bodily emotion recognition. First, ResNet50, DNN and spatial ransformer models are used to capture facial texture vectors, bodily skeleton vectors and whole-body geometric vectors, and an intraframe correlation attention guidance (S-CAG) mechanism, which guides the facial texture vector and the bodily skeleton vector by the whole-body geometric vector, is designed to exploit the spatial potential emotional correlation between face and posture. Second, an interframe significant segment enhancement (T-SSE) structure is embedded into a temporal transformer to enhance high emotional intensity frame information and avoid emotional asynchrony. Finally, an adaptive weight assignment (M-AWA) strategy is constructed to realise facial-bodily fusion. The experimental results on the BabyRobot Emotion Dataset (BRED) and Context-Aware Emotion Recognition (CAER) dataset indicate that the proposed network reaches accuracies of 81.61% and 89.39%, which are 9.61% and 9.46% higher than those of the baseline network, respectively. Compared with the state-of-the-art methods, the proposed method achieves 7.73% and 20.57% higher accuracy than single-modal methods based on facial expression or bodily posture, respectively, and 2.16% higher accuracy than the dual-modal methods based on facial-bodily fusion. Therefore, the proposed method, which adaptively fuses the complementary information of face and posture, improves the quality of emotion recognition in real-world scenarios.

Abstract Image

查看原文本刊更多论文

面部-身体情感识别的空间引导和时间增强网络

为克服自然场景中基于面部表情或身体姿态的单模态情绪识别的不足，提出了一种基于空间引导和时间增强（SG-TE）的面部-身体情绪识别网络。首先，利用ResNet50、DNN和空间变换模型捕获人脸纹理向量、身体骨架向量和全身几何向量，设计框架内相关注意引导（S-CAG）机制，利用人脸与姿态之间的空间潜在情感关联，利用全身几何向量引导人脸纹理向量和身体骨架向量；其次，在时序转换器中嵌入帧间显著段增强（T-SSE）结构，增强高情绪强度帧信息，避免情绪不同步。最后，构造了一种自适应权重分配（M-AWA）策略，实现了面部与身体的融合。在BabyRobot情感数据集（BRED）和情境感知情感识别（CAER）数据集上的实验结果表明，该网络的准确率分别为81.61%和89.39%，比基线网络分别提高了9.61%和9.46%。与现有方法相比，该方法的准确率分别比基于面部表情和身体姿势的单模态方法高7.73%和20.57%，比基于面部和身体融合的双模态方法高2.16%。因此，该方法自适应地融合了人脸和姿态的互补信息，提高了真实场景下的情绪识别质量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

CAAI Transactions on Intelligence Technology COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-

CiteScore

11.00

自引率

3.90%

发文量

134

审稿时长

35 weeks

期刊介绍： CAAI Transactions on Intelligence Technology is a leading venue for original research on the theoretical and experimental aspects of artificial intelligence technology. We are a fully open access journal co-published by the Institution of Engineering and Technology (IET) and the Chinese Association for Artificial Intelligence (CAAI) providing research which is openly accessible to read and share worldwide.