探索用于连续情感识别的多模态视觉特征

Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge Pub Date : 2016-10-16 DOI:10.1145/2988257.2988270

Bo Sun, Siming Cao, Liandong Li, Jun He, Lejun Yu

{"title":"探索用于连续情感识别的多模态视觉特征","authors":"Bo Sun, Siming Cao, Liandong Li, Jun He, Lejun Yu","doi":"10.1145/2988257.2988270","DOIUrl":null,"url":null,"abstract":"This paper presents our work in the Emotion Sub-Challenge of the 6th Audio/Visual Emotion Challenge and Workshop (AVEC 2016), whose goal is to explore utilizing audio, visual and physiological signals to continuously predict the value of the emotion dimensions (arousal and valence). As visual features are very important in emotion recognition, we try a variety of handcrafted and deep visual features. For each video clip, besides the baseline features, we extract multi-scale Dense SIFT features (MSDF), and some types of Convolutional neural networks (CNNs) features to recognize the expression phases of the current frame. We train linear Support Vector Regression (SVR) for every kind of features on the RECOLA dataset. Multimodal fusion of these modalities is then performed with a multiple linear regression model. The final Concordance Correlation Coefficient (CCC) we gained on the development set are 0.824 for arousal, and 0.718 for valence; and on the test set are 0.683 for arousal and 0.642 for valence.","PeriodicalId":432793,"journal":{"name":"Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge","volume":"236 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"28","resultStr":"{\"title\":\"Exploring Multimodal Visual Features for Continuous Affect Recognition\",\"authors\":\"Bo Sun, Siming Cao, Liandong Li, Jun He, Lejun Yu\",\"doi\":\"10.1145/2988257.2988270\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents our work in the Emotion Sub-Challenge of the 6th Audio/Visual Emotion Challenge and Workshop (AVEC 2016), whose goal is to explore utilizing audio, visual and physiological signals to continuously predict the value of the emotion dimensions (arousal and valence). As visual features are very important in emotion recognition, we try a variety of handcrafted and deep visual features. For each video clip, besides the baseline features, we extract multi-scale Dense SIFT features (MSDF), and some types of Convolutional neural networks (CNNs) features to recognize the expression phases of the current frame. We train linear Support Vector Regression (SVR) for every kind of features on the RECOLA dataset. Multimodal fusion of these modalities is then performed with a multiple linear regression model. The final Concordance Correlation Coefficient (CCC) we gained on the development set are 0.824 for arousal, and 0.718 for valence; and on the test set are 0.683 for arousal and 0.642 for valence.\",\"PeriodicalId\":432793,\"journal\":{\"name\":\"Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge\",\"volume\":\"236 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-10-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"28\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2988257.2988270\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2988257.2988270","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 28

摘要

本文介绍了我们在第六届视听情感挑战与研讨会(AVEC 2016)的情感子挑战中的工作，其目标是探索利用音频，视觉和生理信号来持续预测情感维度(唤醒和效价)的值。由于视觉特征在情感识别中非常重要，我们尝试了各种手工制作的深度视觉特征。对于每个视频片段，除了提取基线特征外，我们还提取了多尺度密集SIFT特征(MSDF)和一些类型的卷积神经网络(cnn)特征来识别当前帧的表达阶段。我们对RECOLA数据集上的每一类特征训练线性支持向量回归(SVR)。然后用多元线性回归模型对这些模态进行多模态融合。发展集的最终一致性相关系数(CCC)为唤醒集0.824，效价集0.718;在测试集上，唤醒是0.683效价是0.642。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Exploring Multimodal Visual Features for Continuous Affect Recognition

This paper presents our work in the Emotion Sub-Challenge of the 6th Audio/Visual Emotion Challenge and Workshop (AVEC 2016), whose goal is to explore utilizing audio, visual and physiological signals to continuously predict the value of the emotion dimensions (arousal and valence). As visual features are very important in emotion recognition, we try a variety of handcrafted and deep visual features. For each video clip, besides the baseline features, we extract multi-scale Dense SIFT features (MSDF), and some types of Convolutional neural networks (CNNs) features to recognize the expression phases of the current frame. We train linear Support Vector Regression (SVR) for every kind of features on the RECOLA dataset. Multimodal fusion of these modalities is then performed with a multiple linear regression model. The final Concordance Correlation Coefficient (CCC) we gained on the development set are 0.824 for arousal, and 0.718 for valence; and on the test set are 0.683 for arousal and 0.642 for valence.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge

自引率

0.00%

发文量