Filip Povolný, P. Matejka, Michal Hradiš, A. Popková, Lubomír Otrusina, P. Smrz, Ian D. Wood, Cécile Robin, L. Lamel
{"title":"Multimodal Emotion Recognition for AVEC 2016 Challenge","authors":"Filip Povolný, P. Matejka, Michal Hradiš, A. Popková, Lubomír Otrusina, P. Smrz, Ian D. Wood, Cécile Robin, L. Lamel","doi":"10.1145/2988257.2988268","DOIUrl":"https://doi.org/10.1145/2988257.2988268","url":null,"abstract":"This paper describes a systems for emotion recognition and its application on the dataset from the AV+EC 2016 Emotion Recognition Challenge. The realized system was produced and submitted to the AV+EC 2016 evaluation, making use of all three modalities (audio, video, and physiological data). Our work primarily focused on features derived from audio. The original audio features were complement with bottleneck features and also text-based emotion recognition which is based on transcribing audio by an automatic speech recognition system and applying resources such as word embedding models and sentiment lexicons. Our multimodal fusion reached CCC=0.855 on dev set for arousal and 0.713 for valence. CCC on test set is 0.719 and 0.596 for arousal and valence respectively.","PeriodicalId":432793,"journal":{"name":"Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114990473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Le Yang, D. Jiang, Lang He, Ercheng Pei, Meshia Cédric Oveneke, H. Sahli
{"title":"Decision Tree Based Depression Classification from Audio Video and Language Information","authors":"Le Yang, D. Jiang, Lang He, Ercheng Pei, Meshia Cédric Oveneke, H. Sahli","doi":"10.1145/2988257.2988269","DOIUrl":"https://doi.org/10.1145/2988257.2988269","url":null,"abstract":"In order to improve the recognition accuracy of the Depression Classification Sub-Challenge (DCC) of the AVEC 2016, in this paper we propose a decision tree for depression classification. The decision tree is constructed according to the distribution of the multimodal prediction of PHQ-8 scores and participants' characteristics (PTSD/Depression Diagnostic, sleep-status, feeling and personality) obtained via the analysis of the transcript files of the participants. The proposed gender specific decision tree provides a way of fusing the upper level language information with the results obtained using low level audio and visual features. Experiments are carried out on the Distress Analysis Interview Corpus - Wizard of Oz (DAIC-WOZ) database, results show that the proposed depression classification schemes obtain very promising results on the development set, with F1 score reaching 0.857 for class depressed and 0.964 for class not depressed. Despite of the over-fitting problem in training the models of predicting the PHQ-8 scores, the classification schemes still obtain satisfying performance on the test set. The F1 score reaches 0.571 for class depressed and 0.877 for class not depressed, with the average 0.724 which is higher than the baseline result 0.700.","PeriodicalId":432793,"journal":{"name":"Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge","volume":"66 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123966554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Pampouchidou, Olympia Simantiraki, Amir Fazlollahi, M. Pediaditis, D. Manousos, A. Roniotis, G. Giannakakis, F. Mériaudeau, P. Simos, K. Marias, Fan Yang, M. Tsiknakis
{"title":"Depression Assessment by Fusing High and Low Level Features from Audio, Video, and Text","authors":"A. Pampouchidou, Olympia Simantiraki, Amir Fazlollahi, M. Pediaditis, D. Manousos, A. Roniotis, G. Giannakakis, F. Mériaudeau, P. Simos, K. Marias, Fan Yang, M. Tsiknakis","doi":"10.1145/2988257.2988266","DOIUrl":"https://doi.org/10.1145/2988257.2988266","url":null,"abstract":"Depression is a major cause of disability world-wide. The present paper reports on the results of our participation to the depression sub-challenge of the sixth Audio/Visual Emotion Challenge (AVEC 2016), which was designed to compare feature modalities (audio, visual, interview transcript-based) in gender-based and gender-independent modes using a variety of classification algorithms. In our approach, both high and low level features were assessed in each modality. Audio features were extracted from the low-level descriptors provided by the challenge organizers. Several visual features were extracted and assessed including dynamic characteristics of facial elements (using Landmark Motion History Histograms and Landmark Motion Magnitude), global head motion, and eye blinks. These features were combined with statistically derived features from pre-extracted features (emotions, action units, gaze, and pose). Both speech rate and word-level semantic content were also evaluated. Classification results are reported using four different classification schemes: i) gender-based models for each individual modality, ii) the feature fusion model, ii) the decision fusion model, and iv) the posterior probability classification model. Proposed approaches outperforming the reference classification accuracy include the one utilizing statistical descriptors of low-level audio features. This approach achieved f1-scores of 0.59 for identifying depressed and 0.87 for identifying not-depressed individuals on the development set and 0.52/0.81, respectively for the test set.","PeriodicalId":432793,"journal":{"name":"Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116680869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Brady, Youngjune Gwon, Pooya Khorrami, Elizabeth Godoy, W. Campbell, Charlie K. Dagli, Thomas S. Huang
{"title":"Multi-Modal Audio, Video and Physiological Sensor Learning for Continuous Emotion Prediction","authors":"K. Brady, Youngjune Gwon, Pooya Khorrami, Elizabeth Godoy, W. Campbell, Charlie K. Dagli, Thomas S. Huang","doi":"10.1145/2988257.2988264","DOIUrl":"https://doi.org/10.1145/2988257.2988264","url":null,"abstract":"The automatic determination of emotional state from multimedia content is an inherently challenging problem with a broad range of applications including biomedical diagnostics, multimedia retrieval, and human computer interfaces. The Audio Video Emotion Challenge (AVEC) 2016 provides a well-defined framework for developing and rigorously evaluating innovative approaches for estimating the arousal and valence states of emotion as a function of time. It presents the opportunity for investigating multimodal solutions that include audio, video, and physiological sensor signals. This paper provides an overview of our AVEC Emotion Challenge system, which uses multi-feature learning and fusion across all available modalities. It includes a number of technical contributions, including the development of novel high- and low-level features for modeling emotion in the audio, video, and physiological channels. Low-level features include modeling arousal in audio with minimal prosodic-based descriptors. High-level features are derived from supervised and unsupervised machine learning approaches based on sparse coding and deep learning. Finally, a state space estimation approach is applied for score fusion that demonstrates the importance of exploiting the time-series nature of the arousal and valence states. The resulting system outperforms the baseline systems [10] on the test evaluation set with an achieved Concordant Correlation Coefficient (CCC) for arousal of 0.770 vs 0.702 (baseline) and for valence of 0.687 vs 0.638. Future work will focus on exploiting the time-varying nature of individual channels in the multi-modal framework.","PeriodicalId":432793,"journal":{"name":"Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121099949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Raphaël Weber, Vincent Barrielle, Catherine Soladié, R. Séguier
{"title":"High-Level Geometry-based Features of Video Modality for Emotion Prediction","authors":"Raphaël Weber, Vincent Barrielle, Catherine Soladié, R. Séguier","doi":"10.1145/2988257.2988262","DOIUrl":"https://doi.org/10.1145/2988257.2988262","url":null,"abstract":"The automatic analysis of emotion remains a challenging task in unconstrained experimental conditions. In this paper, we present our contribution to the 6th Audio/Visual Emotion Challenge (AVEC 2016), which aims at predicting the continuous emotional dimensions of arousal and valence. First, we propose to improve the performance of the multimodal prediction with low-level features by adding high-level geometry-based features, namely head pose and expression signature. The head pose is estimated by fitting a reference 3D mesh to the 2D facial landmarks. The expression signature is the projection of the facial landmarks in an unsupervised person-specific model. Second, we propose to fuse the unimodal predictions trained on each training subject before performing the multimodal fusion. The results show that our high-level features improve the performance of the multimodal prediction of arousal and that the subjects fusion works well in unimodal prediction but generalizes poorly in multimodal prediction, particularly on valence.","PeriodicalId":432793,"journal":{"name":"Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128566505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Krishna Somandepalli, Rahul Gupta, Md. Nasir, Brandon M. Booth, Sungbok Lee, Shrikanth S. Narayanan
{"title":"Online Affect Tracking with Multimodal Kalman Filters","authors":"Krishna Somandepalli, Rahul Gupta, Md. Nasir, Brandon M. Booth, Sungbok Lee, Shrikanth S. Narayanan","doi":"10.1145/2988257.2988259","DOIUrl":"https://doi.org/10.1145/2988257.2988259","url":null,"abstract":"Arousal and valence have been widely used to represent emotions dimensionally and measure them continuously in time. In this paper, we introduce a computational framework for tracking these affective dimensions from multimodal data as an entry to the Multimodal Affect Recognition Sub-Challenge of the 2016 Audio/Visual Emotion Challenge and Workshop (AVEC2016). We propose a linear dynamical system approach with a late fusion method that accounts for the dynamics of the affective state evolution (i.e., arousal or valence). To this end, single-modality predictions are modeled as observations in a Kalman filter formulation in order to continuously track each affective dimension. Leveraging the inter-correlations between arousal and valence, we use the predicted arousal as an additional feature to improve valence predictions. Furthermore, we propose a conditional framework to select Kalman filters of different modalities while tracking. This framework employs voicing probability and facial posture cues to detect the absence or presence of each input modality. Our multimodal fusion results on the development and the test set provide a statistically significant improvement over the baseline system from AVEC2016. The proposed approach can be potentially extended to other multimodal tasks with inter-correlated behavioral dimensions.","PeriodicalId":432793,"journal":{"name":"Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131618876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Valstar, J. Gratch, Björn Schuller, F. Ringeval, D. Lalanne, M. Torres, Stefan Scherer, Giota Stratou, R. Cowie, M. Pantic
{"title":"AVEC 2016: Depression, Mood, and Emotion Recognition Workshop and Challenge","authors":"M. Valstar, J. Gratch, Björn Schuller, F. Ringeval, D. Lalanne, M. Torres, Stefan Scherer, Giota Stratou, R. Cowie, M. Pantic","doi":"10.1145/2988257.2988258","DOIUrl":"https://doi.org/10.1145/2988257.2988258","url":null,"abstract":"The Audio/Visual Emotion Challenge and Workshop (AVEC 2016) \"Depression, Mood and Emotion\" will be the sixth competition event aimed at comparison of multimedia processing and machine learning methods for automatic audio, visual and physiological depression and emotion analysis, with all participants competing under strictly the same conditions. The goal of the Challenge is to provide a common benchmark test set for multi-modal information processing and to bring together the depression and emotion recognition communities, as well as the audio, video and physiological processing communities, to compare the relative merits of the various approaches to depression and emotion recognition under well-defined and strictly comparable conditions and establish to what extent fusion of the approaches is possible and beneficial. This paper presents the challenge guidelines, the common data used, and the performance of the baseline system on the two tasks.","PeriodicalId":432793,"journal":{"name":"Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134475770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Keynote Address","authors":"M. Valstar","doi":"10.1145/3255910","DOIUrl":"https://doi.org/10.1145/3255910","url":null,"abstract":"","PeriodicalId":432793,"journal":{"name":"Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124031827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Valstar, J. Gratch, Björn Schuller, F. Ringeval, R. Cowie, M. Pantic
{"title":"Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge","authors":"M. Valstar, J. Gratch, Björn Schuller, F. Ringeval, R. Cowie, M. Pantic","doi":"10.1145/2988257","DOIUrl":"https://doi.org/10.1145/2988257","url":null,"abstract":"It is our great pleasure to welcome you to the 5th Audio-Visual Emotion recognition Challenge (AVEC 2015), held in conjunction with the ACM Multimedia 2015. This year's challenge and associated workshop continues to push the boundaries of audio-visual emotion recognition. The first AVEC challenge posed the problem of detecting discrete emotion classes on an extremely large set of natural behaviour data. The second AVEC extended this problem to the prediction of continuous valued dimensional affect on the same set of challenging data. In its third edition, we enlarged the problem even further to include the prediction of self-reported severity of depression. The fourth edition of AVEC focused on the study of depression and affect by narrowing down the number of tasks to be used, and enriching the annotation. Finally, this year we've focused the study of affect by including physiology, along with audio-visual data, in the dataset, making the very first emotion recognition challenge that bridges across audio, video and physiological data. \u0000 \u0000The mission of AVEC challenge and workshop series is to provide a common benchmark test set for individual multimodal information processing and to bring together the audio, video and -- for the first time ever -- physiological emotion recognition communities, to compare the relative merits of the three approaches to emotion recognition under well-defined and strictly comparable conditions and establish to what extent fusion of the approaches is possible and beneficial. A second motivation is the need to advance emotion recognition systems to be able to deal with naturalistic behaviour in large volumes of un-segmented, non-prototypical and non-preselected data. As you will see, these goals have been reached with the selection of this year's data and the challenge contributions. \u0000 \u0000The call for participation attracted 15 submissions from Asia, Europe, Oceania and North America. The programme committee accepted 9 papers in addition to the baseline paper for oral presentation. For the challenge, no less than 48 results submissions were made by 13 teams! We hope that these proceedings will serve as a valuable reference for researchers and developers in the area of audio-visual-physiological emotion recognition and analysis. \u0000 \u0000We also encourage attendees to attend the keynote presentation. This valuable and insightful talk can and will guide us to a better understanding of the state of the field, and future direction: \u0000AVEC'15 Keynote Talk -- From Facial Expression Analysis to Multimodal Mood Analysis, Pr. Roland Goecke (University of Canberra, Australia)","PeriodicalId":432793,"journal":{"name":"Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128268105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}