三维骨骼运动增强的情感识别网络

IF 3.2 Q1 Computer Science

APSIPA Transactions on Signal and Information Processing Pub Date : 2021-08-05 DOI:10.1017/ATSIP.2021.11

Jiaqi Shi, Chaoran Liu, C. Ishi, H. Ishiguro

{"title":"三维骨骼运动增强的情感识别网络","authors":"Jiaqi Shi, Chaoran Liu, C. Ishi, H. Ishiguro","doi":"10.1017/ATSIP.2021.11","DOIUrl":null,"url":null,"abstract":"Automatic emotion recognition has become an important trend in the fields of human–computer natural interaction and artificial intelligence. Although gesture is one of the most important components of nonverbal communication, which has a considerable impact on emotion recognition, it is rarely considered in the study of emotion recognition. An important reason is the lack of large open-source emotional databases containing skeletal movement data. In this paper, we extract three-dimensional skeleton information from videos and apply the method to IEMOCAP database to add a new modality. We propose an attention-based convolutional neural network which takes the extracted data as input to predict the speakers’ emotional state. We also propose a graph attention-based fusion method that combines our model with the models using other modalities, to provide complementary information in the emotion classification task and effectively fuse multimodal cues. The combined model utilizes audio signals, text information, and skeletal data. The performance of the model significantly outperforms the bimodal model and other fusion strategies, proving the effectiveness of the method.","PeriodicalId":44812,"journal":{"name":"APSIPA Transactions on Signal and Information Processing","volume":" ","pages":""},"PeriodicalIF":3.2000,"publicationDate":"2021-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"3D skeletal movement-enhanced emotion recognition networks\",\"authors\":\"Jiaqi Shi, Chaoran Liu, C. Ishi, H. Ishiguro\",\"doi\":\"10.1017/ATSIP.2021.11\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Automatic emotion recognition has become an important trend in the fields of human–computer natural interaction and artificial intelligence. Although gesture is one of the most important components of nonverbal communication, which has a considerable impact on emotion recognition, it is rarely considered in the study of emotion recognition. An important reason is the lack of large open-source emotional databases containing skeletal movement data. In this paper, we extract three-dimensional skeleton information from videos and apply the method to IEMOCAP database to add a new modality. We propose an attention-based convolutional neural network which takes the extracted data as input to predict the speakers’ emotional state. We also propose a graph attention-based fusion method that combines our model with the models using other modalities, to provide complementary information in the emotion classification task and effectively fuse multimodal cues. The combined model utilizes audio signals, text information, and skeletal data. The performance of the model significantly outperforms the bimodal model and other fusion strategies, proving the effectiveness of the method.\",\"PeriodicalId\":44812,\"journal\":{\"name\":\"APSIPA Transactions on Signal and Information Processing\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":3.2000,\"publicationDate\":\"2021-08-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"APSIPA Transactions on Signal and Information Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1017/ATSIP.2021.11\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"Computer Science\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"APSIPA Transactions on Signal and Information Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1017/ATSIP.2021.11","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Computer Science","Score":null,"Total":0}

引用次数: 1

摘要

情绪自动识别已成为人机自然交互和人工智能领域的一个重要趋势。尽管手势是非言语交际中最重要的组成部分之一，对情绪识别有着相当大的影响，但在情绪识别的研究中很少考虑它。一个重要的原因是缺乏包含骨骼运动数据的大型开源情感数据库。在本文中，我们从视频中提取三维骨架信息，并将该方法应用于IEMOCAP数据库以添加新的模态。我们提出了一种基于注意力的卷积神经网络，该网络以提取的数据为输入来预测说话人的情绪状态。我们还提出了一种基于图注意力的融合方法，将我们的模型与使用其他模态的模型相结合，以在情绪分类任务中提供互补信息，并有效地融合多模态线索。组合模型利用音频信号、文本信息和骨架数据。该模型的性能显著优于双峰模型和其他融合策略，证明了该方法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

3D skeletal movement-enhanced emotion recognition networks

Automatic emotion recognition has become an important trend in the fields of human–computer natural interaction and artificial intelligence. Although gesture is one of the most important components of nonverbal communication, which has a considerable impact on emotion recognition, it is rarely considered in the study of emotion recognition. An important reason is the lack of large open-source emotional databases containing skeletal movement data. In this paper, we extract three-dimensional skeleton information from videos and apply the method to IEMOCAP database to add a new modality. We propose an attention-based convolutional neural network which takes the extracted data as input to predict the speakers’ emotional state. We also propose a graph attention-based fusion method that combines our model with the models using other modalities, to provide complementary information in the emotion classification task and effectively fuse multimodal cues. The combined model utilizes audio signals, text information, and skeletal data. The performance of the model significantly outperforms the bimodal model and other fusion strategies, proving the effectiveness of the method.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

APSIPA Transactions on Signal and Information Processing ENGINEERING, ELECTRICAL & ELECTRONIC-

CiteScore

8.60

自引率

6.20%

发文量

审稿时长

40 weeks