Affect Recognition using Key Frame Selection based on Minimum Sparse Reconstruction

M. Kayaoglu, Ç. Erdem
{"title":"Affect Recognition using Key Frame Selection based on Minimum Sparse Reconstruction","authors":"M. Kayaoglu, Ç. Erdem","doi":"10.1145/2818346.2830594","DOIUrl":null,"url":null,"abstract":"In this paper, we present the methods used for Bahcesehir University team's submissions to the 2015 Emotion Recognition in the Wild Challenge. The challenge consists of categorical emotion recognition in short video clips extracted from movies based on emotional keywords in the subtitles. The video clips mostly contain expressive faces (single or multiple) and also audio which contains the speech of the person in the clip as well as other human voices or background sounds/music. We use an audio-visual method based on video summarization by key frame selection. The key frame selection uses a minimum sparse reconstruction approach with the goal of representing the original video in the best possible way. We extract the LPQ features of the key frames and average them to determine a single feature vector that will represent the video component of the clip. In order to represent the temporal variations of the facial expression, we also use the LBP-TOP features extracted from the whole video. The audio features are extracted using OpenSMILE or RASTA-PLP methods. Video and audio features are classified using SVM classifiers and fused at the score level. We tested eight different combinations of audio and visual features on the AFEW 5.0 (Acted Facial Expressions in the Wild) database provided by the challenge organizers. The best visual and audio-visual accuracies obtained on the test set are 45.1% and 49.9% respectively, whereas the video-based baseline for the challenge is given as 39.3%.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"37 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2818346.2830594","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 17

Abstract

In this paper, we present the methods used for Bahcesehir University team's submissions to the 2015 Emotion Recognition in the Wild Challenge. The challenge consists of categorical emotion recognition in short video clips extracted from movies based on emotional keywords in the subtitles. The video clips mostly contain expressive faces (single or multiple) and also audio which contains the speech of the person in the clip as well as other human voices or background sounds/music. We use an audio-visual method based on video summarization by key frame selection. The key frame selection uses a minimum sparse reconstruction approach with the goal of representing the original video in the best possible way. We extract the LPQ features of the key frames and average them to determine a single feature vector that will represent the video component of the clip. In order to represent the temporal variations of the facial expression, we also use the LBP-TOP features extracted from the whole video. The audio features are extracted using OpenSMILE or RASTA-PLP methods. Video and audio features are classified using SVM classifiers and fused at the score level. We tested eight different combinations of audio and visual features on the AFEW 5.0 (Acted Facial Expressions in the Wild) database provided by the challenge organizers. The best visual and audio-visual accuracies obtained on the test set are 45.1% and 49.9% respectively, whereas the video-based baseline for the challenge is given as 39.3%.
基于最小稀疏重建的关键帧选择影响识别
在本文中,我们介绍了Bahcesehir大学团队向2015年野生挑战中的情感识别提交的方法。挑战包括基于字幕中的情感关键词对从电影中提取的短视频片段进行分类情感识别。视频片段大多包含表情丰富的面孔(单个或多个),还有音频,其中包含剪辑中人物的讲话以及其他人声或背景声音/音乐。我们采用了一种基于关键帧选择的视频摘要的视听方法。关键帧选择使用最小稀疏重建方法,目标是以最好的方式表示原始视频。我们提取关键帧的LPQ特征,并对它们进行平均,以确定一个单一的特征向量,该特征向量将代表剪辑的视频分量。为了表示面部表情的时间变化,我们还使用了从整个视频中提取的LBP-TOP特征。使用OpenSMILE或RASTA-PLP方法提取音频特征。使用SVM分类器对视频和音频特征进行分类,并在分数水平上进行融合。我们在挑战组织者提供的AFEW 5.0(野外面部表情)数据库上测试了八种不同的音频和视觉特征组合。在测试集上获得的最佳视觉和视听准确度分别为45.1%和49.9%,而基于视频的挑战基线为39.3%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信