情境感知情感识别通过代理-场景交互

IF 8 2区 计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS
Yu-Xiang Chen, Hong-Mei Sun, Cheng-Yue Che, Shuo Feng, Rui-Sheng Jia
{"title":"情境感知情感识别通过代理-场景交互","authors":"Yu-Xiang Chen,&nbsp;Hong-Mei Sun,&nbsp;Cheng-Yue Che,&nbsp;Shuo Feng,&nbsp;Rui-Sheng Jia","doi":"10.1016/j.engappai.2025.111581","DOIUrl":null,"url":null,"abstract":"<div><div>In real-world scenarios, context-aware emotion recognition (CAER) is a key problem in affective computing with broad application prospects. Most current CAER methods primarily rely on image-level contextual features. However, the interactive relationships between the agent and other objects within the scene are often overlooked or only partially modeled, which limits emotion recognition accuracy. To address this, we proposed a spatial interactive context-aware emotion network (ICENet) that consists of an agent feature extraction branch and a scene-context interaction branch. Specifically, the agent feature extraction branch aims to extract facial and posture features from the target agent and fuse them. In the facial feature extraction network named FaceNet, pure Convolutional Neural Network (ConvNeXt) is used as the backbone to extract global features, and a self-attention-based fine-grained feature extraction (FGFE) module is designed to capture more discriminative local features. In the posture feature extraction network, semantic segmentation is used to extract human silhouettes, which are then processed by Vision Transformer to obtain posture-related features. Meanwhile, the scene-context interaction branch named ObjNet integrates agent’s gaze angle and global depth maps to construct target agent-objects relationship in three-dimensional (TAR3D). Subsequently, a Graph Convolutional Network is employed to model the TAR3D and extract scene-context interaction features. Subsequently, a multiplicative fusion strategy is adopted to integrate agent features with scene-context interaction features, and emotion classification is performed based on the fused representation. Finally, experiments on EMOTIC and CAER-S datasets show that our approach outperforms current state-of-the-art methods in classification accuracy. The code is available at <span><span>https://github.com/Cyx336/ICENet.git</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50523,"journal":{"name":"Engineering Applications of Artificial Intelligence","volume":"158 ","pages":"Article 111581"},"PeriodicalIF":8.0000,"publicationDate":"2025-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Context-aware emotion recognition through agent-scene interactions\",\"authors\":\"Yu-Xiang Chen,&nbsp;Hong-Mei Sun,&nbsp;Cheng-Yue Che,&nbsp;Shuo Feng,&nbsp;Rui-Sheng Jia\",\"doi\":\"10.1016/j.engappai.2025.111581\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>In real-world scenarios, context-aware emotion recognition (CAER) is a key problem in affective computing with broad application prospects. Most current CAER methods primarily rely on image-level contextual features. However, the interactive relationships between the agent and other objects within the scene are often overlooked or only partially modeled, which limits emotion recognition accuracy. To address this, we proposed a spatial interactive context-aware emotion network (ICENet) that consists of an agent feature extraction branch and a scene-context interaction branch. Specifically, the agent feature extraction branch aims to extract facial and posture features from the target agent and fuse them. In the facial feature extraction network named FaceNet, pure Convolutional Neural Network (ConvNeXt) is used as the backbone to extract global features, and a self-attention-based fine-grained feature extraction (FGFE) module is designed to capture more discriminative local features. In the posture feature extraction network, semantic segmentation is used to extract human silhouettes, which are then processed by Vision Transformer to obtain posture-related features. Meanwhile, the scene-context interaction branch named ObjNet integrates agent’s gaze angle and global depth maps to construct target agent-objects relationship in three-dimensional (TAR3D). Subsequently, a Graph Convolutional Network is employed to model the TAR3D and extract scene-context interaction features. Subsequently, a multiplicative fusion strategy is adopted to integrate agent features with scene-context interaction features, and emotion classification is performed based on the fused representation. Finally, experiments on EMOTIC and CAER-S datasets show that our approach outperforms current state-of-the-art methods in classification accuracy. The code is available at <span><span>https://github.com/Cyx336/ICENet.git</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":50523,\"journal\":{\"name\":\"Engineering Applications of Artificial Intelligence\",\"volume\":\"158 \",\"pages\":\"Article 111581\"},\"PeriodicalIF\":8.0000,\"publicationDate\":\"2025-06-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Engineering Applications of Artificial Intelligence\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0952197625015830\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"AUTOMATION & CONTROL SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Engineering Applications of Artificial Intelligence","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0952197625015830","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
引用次数: 0

摘要

在现实场景中,情境感知情感识别(CAER)是情感计算中的一个关键问题,具有广阔的应用前景。目前大多数CAER方法主要依赖于图像级上下文特征。然而,智能体与场景中其他物体之间的交互关系往往被忽略或只是部分建模,这限制了情感识别的准确性。为了解决这个问题,我们提出了一个空间交互式上下文感知情感网络(ICENet),该网络由智能体特征提取分支和场景-上下文交互分支组成。具体而言,agent特征提取分支旨在从目标agent中提取面部和姿态特征并进行融合。在面部特征提取网络FaceNet中,采用纯卷积神经网络(ConvNeXt)作为主干提取全局特征,设计基于自关注的细粒度特征提取(FGFE)模块,捕获更具判别性的局部特征。在姿态特征提取网络中,使用语义分割提取人体轮廓,然后通过Vision Transformer对轮廓进行处理,得到与姿态相关的特征。同时,场景-上下文交互分支ObjNet集成了智能体的凝视角和全局深度图,构建了目标-智能体-对象的三维关系(TAR3D)。随后,利用图卷积网络对TAR3D模型进行建模,提取场景-上下文交互特征。随后,采用乘法融合策略将智能体特征与场景-上下文交互特征进行融合,并基于融合表征进行情感分类。最后,在EMOTIC和CAER-S数据集上的实验表明,我们的方法在分类精度上优于当前最先进的方法。代码可在https://github.com/Cyx336/ICENet.git上获得。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Context-aware emotion recognition through agent-scene interactions
In real-world scenarios, context-aware emotion recognition (CAER) is a key problem in affective computing with broad application prospects. Most current CAER methods primarily rely on image-level contextual features. However, the interactive relationships between the agent and other objects within the scene are often overlooked or only partially modeled, which limits emotion recognition accuracy. To address this, we proposed a spatial interactive context-aware emotion network (ICENet) that consists of an agent feature extraction branch and a scene-context interaction branch. Specifically, the agent feature extraction branch aims to extract facial and posture features from the target agent and fuse them. In the facial feature extraction network named FaceNet, pure Convolutional Neural Network (ConvNeXt) is used as the backbone to extract global features, and a self-attention-based fine-grained feature extraction (FGFE) module is designed to capture more discriminative local features. In the posture feature extraction network, semantic segmentation is used to extract human silhouettes, which are then processed by Vision Transformer to obtain posture-related features. Meanwhile, the scene-context interaction branch named ObjNet integrates agent’s gaze angle and global depth maps to construct target agent-objects relationship in three-dimensional (TAR3D). Subsequently, a Graph Convolutional Network is employed to model the TAR3D and extract scene-context interaction features. Subsequently, a multiplicative fusion strategy is adopted to integrate agent features with scene-context interaction features, and emotion classification is performed based on the fused representation. Finally, experiments on EMOTIC and CAER-S datasets show that our approach outperforms current state-of-the-art methods in classification accuracy. The code is available at https://github.com/Cyx336/ICENet.git.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Engineering Applications of Artificial Intelligence
Engineering Applications of Artificial Intelligence 工程技术-工程:电子与电气
CiteScore
9.60
自引率
10.00%
发文量
505
审稿时长
68 days
期刊介绍: Artificial Intelligence (AI) is pivotal in driving the fourth industrial revolution, witnessing remarkable advancements across various machine learning methodologies. AI techniques have become indispensable tools for practicing engineers, enabling them to tackle previously insurmountable challenges. Engineering Applications of Artificial Intelligence serves as a global platform for the swift dissemination of research elucidating the practical application of AI methods across all engineering disciplines. Submitted papers are expected to present novel aspects of AI utilized in real-world engineering applications, validated using publicly available datasets to ensure the replicability of research outcomes. Join us in exploring the transformative potential of AI in engineering.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信