Jian Huang, Ya Li, J. Tao, Zheng Lian, Mingyue Niu, Minghao Yang
{"title":"Multimodal Continuous Emotion Recognition with Data Augmentation Using Recurrent Neural Networks","authors":"Jian Huang, Ya Li, J. Tao, Zheng Lian, Mingyue Niu, Minghao Yang","doi":"10.1145/3266302.3266304","DOIUrl":"https://doi.org/10.1145/3266302.3266304","url":null,"abstract":"This paper presents our effects for Cross-cultural Emotion Sub-challenge in the Audio/Visual Emotion Challenge (AVEC) 2018, whose goal is to predict the level of three emotional dimensions time-continuously in a cross-cultural setup. We extract the emotional features from audio, visual and textual modalities. The state of art regressor for continuous emotion recognition, long short term memory recurrent neural network (LSTM-RNN) is utilized. We augment the training data by replacing the original training samples with shorter overlapping samples extracted from them, thus multiplying the number of training samples and also beneficial to train emotional temporal model with LSTM-RNN. In addition, two strategies are explored to decrease the interlocutor influence to improve the performance. We also compare the performance of feature level fusion and decision level fusion. The experimental results show the efficiency of the proposed method and competitive results are obtained.","PeriodicalId":123523,"journal":{"name":"Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121458009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jian Huang, Ya Li, J. Tao, Zheng Lian, Mingyue Niu, Minghao Yang
{"title":"Deep Learning for Continuous Multiple Time Series Annotations","authors":"Jian Huang, Ya Li, J. Tao, Zheng Lian, Mingyue Niu, Minghao Yang","doi":"10.1145/3266302.3266305","DOIUrl":"https://doi.org/10.1145/3266302.3266305","url":null,"abstract":"Learning from multiple annotations is an increasingly important research topic. Compared with conventional classification or regression problems, it faces more challenges because time-continuous annotations would result in noisy and temporal lags problems for continuous emotion recognition. In this paper, we address the problem by deep learning for continuous multiple time series annotations. We attach a novel crowd layer to the output layer of basic continuous emotion recognition system, which learns directly from the noisy labels of multiple annotators with end-to-end manner. The inputs of the system are multimodal features and the targets are multiple annotations, with the intention of learning an annotator-specific mapping. Our proposed method considers the ground truth as latent variables and multiple annotations are variant of ground truth by linear mapping. The experimental results show that our system can achieve superior performance and capture the reliabilities and biases of different annotators.","PeriodicalId":123523,"journal":{"name":"Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130196722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
F. Ringeval, Björn Schuller, M. Valstar, R. Cowie, Heysem Kaya, Maximilian Schmitt, S. Amiriparian, N. Cummins, D. Lalanne, Adrien Michaud, E. Çiftçi, Hüseyin Güleç, A. A. Salah, M. Pantic
{"title":"AVEC 2018 Workshop and Challenge: Bipolar Disorder and Cross-Cultural Affect Recognition","authors":"F. Ringeval, Björn Schuller, M. Valstar, R. Cowie, Heysem Kaya, Maximilian Schmitt, S. Amiriparian, N. Cummins, D. Lalanne, Adrien Michaud, E. Çiftçi, Hüseyin Güleç, A. A. Salah, M. Pantic","doi":"10.1145/3266302.3266316","DOIUrl":"https://doi.org/10.1145/3266302.3266316","url":null,"abstract":"The Audio/Visual Emotion Challenge and Workshop (AVEC 2018) \"Bipolar disorder, and cross-cultural affect recognition'' is the eighth competition event aimed at the comparison of multimedia processing and machine learning methods for automatic audiovisual health and emotion analysis, with all participants competing strictly under the same conditions. The goal of the Challenge is to provide a common benchmark test set for multimodal information processing and to bring together the health and emotion recognition communities, as well as the audiovisual processing communities, to compare the relative merits of various approaches to health and emotion recognition from real-life data. This paper presents the major novelties introduced this year, the challenge guidelines, the data used, and the performance of the baseline systems on the three proposed tasks: bipolar disorder classification, cross-cultural dimensional emotion recognition, and emotional label generation from individual ratings, respectively.","PeriodicalId":123523,"journal":{"name":"Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114982429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Deep Learning for Affective Computing","authors":"F. Ringeval","doi":"10.1145/3286914","DOIUrl":"https://doi.org/10.1145/3286914","url":null,"abstract":"","PeriodicalId":123523,"journal":{"name":"Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop","volume":"145 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124706105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards a Better Gold Standard: Denoising and Modelling Continuous Emotion Annotations Based on Feature Agglomeration and Outlier Regularisation","authors":"Chen Wang, Phil Lopes, T. Pun, G. Chanel","doi":"10.1145/3266302.3266307","DOIUrl":"https://doi.org/10.1145/3266302.3266307","url":null,"abstract":"Emotions are often perceived by humans through a series of multimodal cues, such as verbal expressions, facial expressions and gestures. In order to recognise emotions automatically, reliable emotional labels are required to learn a mapping from human expressions to corresponding emotions. Dimensional emotion models have become popular and have been widely applied for annotating emotions continuously in the time domain. However, the statistical relationship between emotional dimensions is rarely studied. This paper provides a solution to automatic emotion recognition for the Audio/Visual Emotion Challenge (AVEC) 2018. The objective is to find a robust way to detect emotions using more reliable emotion annotations in the valence and arousal dimensions. The two main contributions of this paper are: 1) the proposal of a new approach capable of generating more dependable emotional ratings for both arousal and valence from multiple annotators by extracting consistent annotation features; 2) the exploration of the valence and arousal distribution using outlier detection methods, which shows a specific oblique elliptic shape. With the learned distribution, we are able to detect the prediction outliers based on their local density deviations and correct them towards the learned distribution. The proposed method performance is evaluated on the RECOLA database containing audio, video and physiological recordings. Our results show that a moving average filter is sufficient to remove the incidental errors in annotations. The unsupervised dimensionality reduction approaches could be used to determine a gold standard annotations from multiple annotations. Compared with the baseline model of AVEC 2018, our approach improved the arousal and valence prediction of concordance correlation coefficient significantly to respectively 0.821 and 0.589.","PeriodicalId":123523,"journal":{"name":"Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop","volume":"130 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130543526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kalani Wataraka Gamage, T. Dang, V. Sethu, J. Epps, E. Ambikairajah
{"title":"Speech-based Continuous Emotion Prediction by Learning Perception Responses related to Salient Events: A Study based on Vocal Affect Bursts and Cross-Cultural Affect in AVEC 2018","authors":"Kalani Wataraka Gamage, T. Dang, V. Sethu, J. Epps, E. Ambikairajah","doi":"10.1145/3266302.3266314","DOIUrl":"https://doi.org/10.1145/3266302.3266314","url":null,"abstract":"This paper presents a novel framework for speech-based continuous emotion prediction. The proposed model characterises the perceived emotion estimation as time-invariant responses to salient events. Then arousal and valence variation over time is modelded as the ouput of a parallel array of time-invariant filters where each filter represents a salient event in this context, and the impulse response of the filter represents the learned perception emotion response. The proposed model is evaluted by considering vocal affect bursts/non-verbal vocal gestures as salient event candidates. The proposed model is validated based on the development dataset of AVEC 2018 challenge development dataset and achieves the highest accuracy of valence prediction among single modal methods based on speech or speech-transcript. We tested this model on cross-cultural settings provided by AVEC 2018 challenge test set, and the model performs reasonably well for an unseen culture as well and outperform speech-based baselines. Further we explore inclusion of interlocutor related cues to the proposed model and decision level fusion with existing features. Since the proposed model was evaluated solely based on laughter and slight laughter affect bursts which were nominated as salient by proposed saliency constrains of the model, the results presented highlight the significance of aforementioned gestures in human emotion expression and perception","PeriodicalId":123523,"journal":{"name":"Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop","volume":"396 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123391457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Le Yang, Yan Li, Haifeng Chen, D. Jiang, Meshia Cédric Oveneke, H. Sahli
{"title":"Bipolar Disorder Recognition with Histogram Features of Arousal and Body Gestures","authors":"Le Yang, Yan Li, Haifeng Chen, D. Jiang, Meshia Cédric Oveneke, H. Sahli","doi":"10.1145/3266302.3266308","DOIUrl":"https://doi.org/10.1145/3266302.3266308","url":null,"abstract":"This paper targets the Bipolar Disorder Challenge (BDC) task of Audio Visual Emotion Challenge (AVEC) 2018. Firstly, two novel features are proposed: 1) a histogram based arousal feature, in which the continuous arousal values are estimated from the audio cues by a Long Short-Term Memory Recurrent Neural Network (LSTM-RNN) model; 2) a Histogram of Displacement (HDR) based upper body posture feature, which characterizes the displacement and velocity of the key body points in the video segment. In addition, we propose a multi-stream bipolar disorder classification framework with Deep Neural Networks (DNNs) and a Random Forest, and adopt the ensemble learning strategy to alleviate the possible over-fitting problem due to the limited training data. Experimental results show that the proposed arousal feature and upper body posture feature are discriminative for different bipolar episodes, and our proposed framework achieves promising classification results on the development set, with the unweighted average recall (UAR) of 0.714, which is higher than the baseline result 0.635. On test set evaluation, our system obtains the same UAR (0.574) as the challenge baseline.","PeriodicalId":123523,"journal":{"name":"Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop","volume":"231 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131985934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning an Arousal-Valence Speech Front-End Network using Media Data In-the-Wild for Emotion Recognition","authors":"Chih-Chuan Lu, Jeng-Lin Li, Chi-Chun Lee","doi":"10.1145/3266302.3266306","DOIUrl":"https://doi.org/10.1145/3266302.3266306","url":null,"abstract":"Recent progress in speech emotion recognition (SER) technology has benefited from the use of deep learning techniques. However, expensive human annotation and difficulty in emotion database collection make it challenging for rapid deployment of SER across diverse application domains. An initialization - fine-tuning strategy help mitigate these technical challenges. In this work, we propose an initialization network that gears toward SER applications by learning the speech front-end network on a large media data collected in-the-wild jointly with proxy arousal-valence labels that are multimodally derived from audio and text information, termed as the Arousal-Valence Speech Front-End Network (AV-SpNET). The AV-SpNET can then be easily stacked simply with the supervised layers for the target emotion corpus of interest. We evaluate our proposed AV-SpNET on tasks of SER for two separate emotion corpora, the USC IEMOCAP and the NNIME database. The AV-SpNET outperforms other initialization techniques and reach the best overall performances requiring only 75% of the in-domain annotated data. We also observe that generally, by using the AV-SpNET as front-end network, it requires as little as 50% of the fine-tuned data to surpass method based on randomly-initialized network with fine-tuning on the complete training set.","PeriodicalId":123523,"journal":{"name":"Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127226415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop","authors":"","doi":"10.1145/3266302","DOIUrl":"https://doi.org/10.1145/3266302","url":null,"abstract":"","PeriodicalId":123523,"journal":{"name":"Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114452106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Interpersonal Behavior Modeling for Personality, Affect, and Mental States Recognition and Analysis","authors":"Chi-Chun Lee","doi":"10.1145/3266302.3266303","DOIUrl":"https://doi.org/10.1145/3266302.3266303","url":null,"abstract":"Imagine humans as complex dynamical systems: systems that are characterized by multiple interacting layers of hidden states (e.g., internal processes involving functions of cognition, perception, production, emotion, and social interaction) producing measurable multimodal signals (e.g., body gestures, facial expressions, physiology, and speech). This abstraction of humans with a signals and systems framework naturally brings a synergy between communities of engineering and behavioral sciences. Various research fields have emerged from such an interdisciplinary human-centered effort, e.g., behavioral signal processing [7], social signal processing [10], and affective computing [8], where technological advancements has continuously been made in order to robustly assess and infer individual speaker's states and traits. The complexities in modeling human behavior are centered on the issue of heterogeneity of human behavior. Sources of variability in human behaviors originate from the differences in mechanisms of information encoding (behavior production) and decoding (behavior perception). Furthermore, a key additional layer of complexity exists because human behaviors occur largely during interactions with the environment and agents therein. This interplay, which causes a coupling effect between humans' behaviors, is the essence of interpersonal dynamics. This unique behavior dynamic has been at core not only in human communication studies [2], but further is crucial in automatic characterizing the speaker's social-affective behavior phenomenon (e.g., emotion recognition [4, 5] and personality trait identification [3, 9]) and in understanding interactions of those typical, distressed to disordered manifestations [1, 6].","PeriodicalId":123523,"journal":{"name":"Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125709246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}