{"title":"Multi-modal Dimensional Emotion Recognition using Recurrent Neural Networks","authors":"Shizhe Chen, Qin Jin","doi":"10.1145/2808196.2811638","DOIUrl":null,"url":null,"abstract":"Emotion recognition has been an active research area with both wide applications and big challenges. This paper presents our effort for the Audio/Visual Emotion Challenge (AVEC2015), whose goal is to explore utilizing audio, visual and physiological signals to continuously predict the value of the emotion dimensions (arousal and valence). Our system applies the Recurrent Neural Networks (RNN) to model temporal information. We explore various aspects to improve the prediction performance including: the dominant modalities for arousal and valence prediction, duration of features, novel loss functions, directions of Long Short Term Memory (LSTM), multi-task learning, different structures for early feature fusion and late fusion. Best settings are chosen according to the performance on the development set. Competitive experimental results compared with the baseline show the effectiveness of the proposed methods.","PeriodicalId":123597,"journal":{"name":"Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge","volume":"81 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"92","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2808196.2811638","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 92
Abstract
Emotion recognition has been an active research area with both wide applications and big challenges. This paper presents our effort for the Audio/Visual Emotion Challenge (AVEC2015), whose goal is to explore utilizing audio, visual and physiological signals to continuously predict the value of the emotion dimensions (arousal and valence). Our system applies the Recurrent Neural Networks (RNN) to model temporal information. We explore various aspects to improve the prediction performance including: the dominant modalities for arousal and valence prediction, duration of features, novel loss functions, directions of Long Short Term Memory (LSTM), multi-task learning, different structures for early feature fusion and late fusion. Best settings are chosen according to the performance on the development set. Competitive experimental results compared with the baseline show the effectiveness of the proposed methods.