Zhong Huang, Danni Zhang, Fuji Ren, Min Hu, Juan Liu, Haitao Yu
{"title":"SG-TE: Spatial Guidance and Temporal Enhancement Network for Facial-Bodily Emotion Recognition","authors":"Zhong Huang, Danni Zhang, Fuji Ren, Min Hu, Juan Liu, Haitao Yu","doi":"10.1049/cit2.70006","DOIUrl":null,"url":null,"abstract":"<p>To overcome the deficiencies of single-modal emotion recognition based on facial expression or bodily posture in natural scenes, a spatial guidance and temporal enhancement (SG-TE) network is proposed for facial-bodily emotion recognition. First, ResNet50, DNN and spatial ransformer models are used to capture facial texture vectors, bodily skeleton vectors and whole-body geometric vectors, and an intraframe correlation attention guidance (S-CAG) mechanism, which guides the facial texture vector and the bodily skeleton vector by the whole-body geometric vector, is designed to exploit the spatial potential emotional correlation between face and posture. Second, an interframe significant segment enhancement (T-SSE) structure is embedded into a temporal transformer to enhance high emotional intensity frame information and avoid emotional asynchrony. Finally, an adaptive weight assignment (M-AWA) strategy is constructed to realise facial-bodily fusion. The experimental results on the BabyRobot Emotion Dataset (BRED) and Context-Aware Emotion Recognition (CAER) dataset indicate that the proposed network reaches accuracies of 81.61% and 89.39%, which are 9.61% and 9.46% higher than those of the baseline network, respectively. Compared with the state-of-the-art methods, the proposed method achieves 7.73% and 20.57% higher accuracy than single-modal methods based on facial expression or bodily posture, respectively, and 2.16% higher accuracy than the dual-modal methods based on facial-bodily fusion. Therefore, the proposed method, which adaptively fuses the complementary information of face and posture, improves the quality of emotion recognition in real-world scenarios.</p>","PeriodicalId":46211,"journal":{"name":"CAAI Transactions on Intelligence Technology","volume":"10 3","pages":"871-890"},"PeriodicalIF":8.4000,"publicationDate":"2025-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cit2.70006","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"CAAI Transactions on Intelligence Technology","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1049/cit2.70006","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
To overcome the deficiencies of single-modal emotion recognition based on facial expression or bodily posture in natural scenes, a spatial guidance and temporal enhancement (SG-TE) network is proposed for facial-bodily emotion recognition. First, ResNet50, DNN and spatial ransformer models are used to capture facial texture vectors, bodily skeleton vectors and whole-body geometric vectors, and an intraframe correlation attention guidance (S-CAG) mechanism, which guides the facial texture vector and the bodily skeleton vector by the whole-body geometric vector, is designed to exploit the spatial potential emotional correlation between face and posture. Second, an interframe significant segment enhancement (T-SSE) structure is embedded into a temporal transformer to enhance high emotional intensity frame information and avoid emotional asynchrony. Finally, an adaptive weight assignment (M-AWA) strategy is constructed to realise facial-bodily fusion. The experimental results on the BabyRobot Emotion Dataset (BRED) and Context-Aware Emotion Recognition (CAER) dataset indicate that the proposed network reaches accuracies of 81.61% and 89.39%, which are 9.61% and 9.46% higher than those of the baseline network, respectively. Compared with the state-of-the-art methods, the proposed method achieves 7.73% and 20.57% higher accuracy than single-modal methods based on facial expression or bodily posture, respectively, and 2.16% higher accuracy than the dual-modal methods based on facial-bodily fusion. Therefore, the proposed method, which adaptively fuses the complementary information of face and posture, improves the quality of emotion recognition in real-world scenarios.
期刊介绍:
CAAI Transactions on Intelligence Technology is a leading venue for original research on the theoretical and experimental aspects of artificial intelligence technology. We are a fully open access journal co-published by the Institution of Engineering and Technology (IET) and the Chinese Association for Artificial Intelligence (CAAI) providing research which is openly accessible to read and share worldwide.