情境感知情感识别通过代理-场景交互

IF 8 2区计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS

Engineering Applications of Artificial Intelligence Pub Date : 2025-06-24 DOI:10.1016/j.engappai.2025.111581

Yu-Xiang Chen, Hong-Mei Sun, Cheng-Yue Che, Shuo Feng, Rui-Sheng Jia

{"title":"情境感知情感识别通过代理-场景交互","authors":"Yu-Xiang Chen, Hong-Mei Sun, Cheng-Yue Che, Shuo Feng, Rui-Sheng Jia","doi":"10.1016/j.engappai.2025.111581","DOIUrl":null,"url":null,"abstract":"<div><div>In real-world scenarios, context-aware emotion recognition (CAER) is a key problem in affective computing with broad application prospects. Most current CAER methods primarily rely on image-level contextual features. However, the interactive relationships between the agent and other objects within the scene are often overlooked or only partially modeled, which limits emotion recognition accuracy. To address this, we proposed a spatial interactive context-aware emotion network (ICENet) that consists of an agent feature extraction branch and a scene-context interaction branch. Specifically, the agent feature extraction branch aims to extract facial and posture features from the target agent and fuse them. In the facial feature extraction network named FaceNet, pure Convolutional Neural Network (ConvNeXt) is used as the backbone to extract global features, and a self-attention-based fine-grained feature extraction (FGFE) module is designed to capture more discriminative local features. In the posture feature extraction network, semantic segmentation is used to extract human silhouettes, which are then processed by Vision Transformer to obtain posture-related features. Meanwhile, the scene-context interaction branch named ObjNet integrates agent’s gaze angle and global depth maps to construct target agent-objects relationship in three-dimensional (TAR3D). Subsequently, a Graph Convolutional Network is employed to model the TAR3D and extract scene-context interaction features. Subsequently, a multiplicative fusion strategy is adopted to integrate agent features with scene-context interaction features, and emotion classification is performed based on the fused representation. Finally, experiments on EMOTIC and CAER-S datasets show that our approach outperforms current state-of-the-art methods in classification accuracy. The code is available at <span><span>https://github.com/Cyx336/ICENet.git</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50523,"journal":{"name":"Engineering Applications of Artificial Intelligence","volume":"158 ","pages":"Article 111581"},"PeriodicalIF":8.0000,"publicationDate":"2025-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Context-aware emotion recognition through agent-scene interactions\",\"authors\":\"Yu-Xiang Chen, Hong-Mei Sun, Cheng-Yue Che, Shuo Feng, Rui-Sheng Jia\",\"doi\":\"10.1016/j.engappai.2025.111581\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>In real-world scenarios, context-aware emotion recognition (CAER) is a key problem in affective computing with broad application prospects. Most current CAER methods primarily rely on image-level contextual features. However, the interactive relationships between the agent and other objects within the scene are often overlooked or only partially modeled, which limits emotion recognition accuracy. To address this, we proposed a spatial interactive context-aware emotion network (ICENet) that consists of an agent feature extraction branch and a scene-context interaction branch. Specifically, the agent feature extraction branch aims to extract facial and posture features from the target agent and fuse them. In the facial feature extraction network named FaceNet, pure Convolutional Neural Network (ConvNeXt) is used as the backbone to extract global features, and a self-attention-based fine-grained feature extraction (FGFE) module is designed to capture more discriminative local features. In the posture feature extraction network, semantic segmentation is used to extract human silhouettes, which are then processed by Vision Transformer to obtain posture-related features. Meanwhile, the scene-context interaction branch named ObjNet integrates agent’s gaze angle and global depth maps to construct target agent-objects relationship in three-dimensional (TAR3D). Subsequently, a Graph Convolutional Network is employed to model the TAR3D and extract scene-context interaction features. Subsequently, a multiplicative fusion strategy is adopted to integrate agent features with scene-context interaction features, and emotion classification is performed based on the fused representation. Finally, experiments on EMOTIC and CAER-S datasets show that our approach outperforms current state-of-the-art methods in classification accuracy. The code is available at <span><span>https://github.com/Cyx336/ICENet.git</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":50523,\"journal\":{\"name\":\"Engineering Applications of Artificial Intelligence\",\"volume\":\"158 \",\"pages\":\"Article 111581\"},\"PeriodicalIF\":8.0000,\"publicationDate\":\"2025-06-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Engineering Applications of Artificial Intelligence\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0952197625015830\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"AUTOMATION & CONTROL SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Engineering Applications of Artificial Intelligence","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0952197625015830","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

在现实场景中，情境感知情感识别（CAER）是情感计算中的一个关键问题，具有广阔的应用前景。目前大多数CAER方法主要依赖于图像级上下文特征。然而，智能体与场景中其他物体之间的交互关系往往被忽略或只是部分建模，这限制了情感识别的准确性。为了解决这个问题，我们提出了一个空间交互式上下文感知情感网络（ICENet），该网络由智能体特征提取分支和场景-上下文交互分支组成。具体而言，agent特征提取分支旨在从目标agent中提取面部和姿态特征并进行融合。在面部特征提取网络FaceNet中，采用纯卷积神经网络（ConvNeXt）作为主干提取全局特征，设计基于自关注的细粒度特征提取（FGFE）模块，捕获更具判别性的局部特征。在姿态特征提取网络中，使用语义分割提取人体轮廓，然后通过Vision Transformer对轮廓进行处理，得到与姿态相关的特征。同时，场景-上下文交互分支ObjNet集成了智能体的凝视角和全局深度图，构建了目标-智能体-对象的三维关系（TAR3D）。随后，利用图卷积网络对TAR3D模型进行建模，提取场景-上下文交互特征。随后，采用乘法融合策略将智能体特征与场景-上下文交互特征进行融合，并基于融合表征进行情感分类。最后，在EMOTIC和CAER-S数据集上的实验表明，我们的方法在分类精度上优于当前最先进的方法。代码可在https://github.com/Cyx336/ICENet.git上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Context-aware emotion recognition through agent-scene interactions

In real-world scenarios, context-aware emotion recognition (CAER) is a key problem in affective computing with broad application prospects. Most current CAER methods primarily rely on image-level contextual features. However, the interactive relationships between the agent and other objects within the scene are often overlooked or only partially modeled, which limits emotion recognition accuracy. To address this, we proposed a spatial interactive context-aware emotion network (ICENet) that consists of an agent feature extraction branch and a scene-context interaction branch. Specifically, the agent feature extraction branch aims to extract facial and posture features from the target agent and fuse them. In the facial feature extraction network named FaceNet, pure Convolutional Neural Network (ConvNeXt) is used as the backbone to extract global features, and a self-attention-based fine-grained feature extraction (FGFE) module is designed to capture more discriminative local features. In the posture feature extraction network, semantic segmentation is used to extract human silhouettes, which are then processed by Vision Transformer to obtain posture-related features. Meanwhile, the scene-context interaction branch named ObjNet integrates agent’s gaze angle and global depth maps to construct target agent-objects relationship in three-dimensional (TAR3D). Subsequently, a Graph Convolutional Network is employed to model the TAR3D and extract scene-context interaction features. Subsequently, a multiplicative fusion strategy is adopted to integrate agent features with scene-context interaction features, and emotion classification is performed based on the fused representation. Finally, experiments on EMOTIC and CAER-S datasets show that our approach outperforms current state-of-the-art methods in classification accuracy. The code is available at https://github.com/Cyx336/ICENet.git.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Engineering Applications of Artificial Intelligence 工程技术-工程：电子与电气

CiteScore

9.60

自引率

10.00%

发文量

505

审稿时长

68 days

期刊介绍： Artificial Intelligence (AI) is pivotal in driving the fourth industrial revolution, witnessing remarkable advancements across various machine learning methodologies. AI techniques have become indispensable tools for practicing engineers, enabling them to tackle previously insurmountable challenges. Engineering Applications of Artificial Intelligence serves as a global platform for the swift dissemination of research elucidating the practical application of AI methods across all engineering disciplines. Submitted papers are expected to present novel aspects of AI utilized in real-world engineering applications, validated using publicly available datasets to ensure the replicability of research outcomes. Join us in exploring the transformative potential of AI in engineering.