Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop最新文献_第2页

Multimodal Continuous Emotion Recognition with Data Augmentation Using Recurrent Neural Networks 基于递归神经网络的数据增强多模态连续情绪识别

Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop Pub Date : 2018-10-15 DOI: 10.1145/3266302.3266304

Jian Huang, Ya Li, J. Tao, Zheng Lian, Mingyue Niu, Minghao Yang

引用次数: 22

Deep Learning for Continuous Multiple Time Series Annotations 连续多时间序列注释的深度学习

Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop Pub Date : 2018-10-15 DOI: 10.1145/3266302.3266305

Jian Huang, Ya Li, J. Tao, Zheng Lian, Mingyue Niu, Minghao Yang

引用次数: 4

AVEC 2018 Workshop and Challenge: Bipolar Disorder and Cross-Cultural Affect Recognition AVEC 2018研讨会和挑战:双相情感障碍和跨文化情感识别

Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop Pub Date : 2018-10-15 DOI: 10.1145/3266302.3266316

F. Ringeval, Björn Schuller, M. Valstar, R. Cowie, Heysem Kaya, Maximilian Schmitt, S. Amiriparian, N. Cummins, D. Lalanne, Adrien Michaud, E. Çiftçi, Hüseyin Güleç, A. A. Salah, M. Pantic

{"title":"AVEC 2018 Workshop and Challenge: Bipolar Disorder and Cross-Cultural Affect Recognition","authors":"F. Ringeval, Björn Schuller, M. Valstar, R. Cowie, Heysem Kaya, Maximilian Schmitt, S. Amiriparian, N. Cummins, D. Lalanne, Adrien Michaud, E. Çiftçi, Hüseyin Güleç, A. A. Salah, M. Pantic","doi":"10.1145/3266302.3266316","DOIUrl":"https://doi.org/10.1145/3266302.3266316","url":null,"abstract":"The Audio/Visual Emotion Challenge and Workshop (AVEC 2018) \"Bipolar disorder, and cross-cultural affect recognition'' is the eighth competition event aimed at the comparison of multimedia processing and machine learning methods for automatic audiovisual health and emotion analysis, with all participants competing strictly under the same conditions. The goal of the Challenge is to provide a common benchmark test set for multimodal information processing and to bring together the health and emotion recognition communities, as well as the audiovisual processing communities, to compare the relative merits of various approaches to health and emotion recognition from real-life data. This paper presents the major novelties introduced this year, the challenge guidelines, the data used, and the performance of the baseline systems on the three proposed tasks: bipolar disorder classification, cross-cultural dimensional emotion recognition, and emotional label generation from individual ratings, respectively.","PeriodicalId":123523,"journal":{"name":"Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114982429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 129

Session details: Deep Learning for Affective Computing 会议详情:情感计算的深度学习

Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop Pub Date : 2018-10-15 DOI: 10.1145/3286914

F. Ringeval

引用次数: 0

Towards a Better Gold Standard: Denoising and Modelling Continuous Emotion Annotations Based on Feature Agglomeration and Outlier Regularisation 迈向更好的黄金标准:基于特征聚集和离群值正则化的连续情感注释去噪和建模

Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop Pub Date : 2018-10-15 DOI: 10.1145/3266302.3266307

Chen Wang, Phil Lopes, T. Pun, G. Chanel

{"title":"Towards a Better Gold Standard: Denoising and Modelling Continuous Emotion Annotations Based on Feature Agglomeration and Outlier Regularisation","authors":"Chen Wang, Phil Lopes, T. Pun, G. Chanel","doi":"10.1145/3266302.3266307","DOIUrl":"https://doi.org/10.1145/3266302.3266307","url":null,"abstract":"Emotions are often perceived by humans through a series of multimodal cues, such as verbal expressions, facial expressions and gestures. In order to recognise emotions automatically, reliable emotional labels are required to learn a mapping from human expressions to corresponding emotions. Dimensional emotion models have become popular and have been widely applied for annotating emotions continuously in the time domain. However, the statistical relationship between emotional dimensions is rarely studied. This paper provides a solution to automatic emotion recognition for the Audio/Visual Emotion Challenge (AVEC) 2018. The objective is to find a robust way to detect emotions using more reliable emotion annotations in the valence and arousal dimensions. The two main contributions of this paper are: 1) the proposal of a new approach capable of generating more dependable emotional ratings for both arousal and valence from multiple annotators by extracting consistent annotation features; 2) the exploration of the valence and arousal distribution using outlier detection methods, which shows a specific oblique elliptic shape. With the learned distribution, we are able to detect the prediction outliers based on their local density deviations and correct them towards the learned distribution. The proposed method performance is evaluated on the RECOLA database containing audio, video and physiological recordings. Our results show that a moving average filter is sufficient to remove the incidental errors in annotations. The unsupervised dimensionality reduction approaches could be used to determine a gold standard annotations from multiple annotations. Compared with the baseline model of AVEC 2018, our approach improved the arousal and valence prediction of concordance correlation coefficient significantly to respectively 0.821 and 0.589.","PeriodicalId":123523,"journal":{"name":"Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop","volume":"130 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130543526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Speech-based Continuous Emotion Prediction by Learning Perception Responses related to Salient Events: A Study based on Vocal Affect Bursts and Cross-Cultural Affect in AVEC 2018 基于显著事件学习知觉反应的基于语音的持续情绪预测——基于声音情感爆发和跨文化情感的研究

Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop Pub Date : 2018-10-15 DOI: 10.1145/3266302.3266314

Kalani Wataraka Gamage, T. Dang, V. Sethu, J. Epps, E. Ambikairajah

{"title":"Speech-based Continuous Emotion Prediction by Learning Perception Responses related to Salient Events: A Study based on Vocal Affect Bursts and Cross-Cultural Affect in AVEC 2018","authors":"Kalani Wataraka Gamage, T. Dang, V. Sethu, J. Epps, E. Ambikairajah","doi":"10.1145/3266302.3266314","DOIUrl":"https://doi.org/10.1145/3266302.3266314","url":null,"abstract":"This paper presents a novel framework for speech-based continuous emotion prediction. The proposed model characterises the perceived emotion estimation as time-invariant responses to salient events. Then arousal and valence variation over time is modelded as the ouput of a parallel array of time-invariant filters where each filter represents a salient event in this context, and the impulse response of the filter represents the learned perception emotion response. The proposed model is evaluted by considering vocal affect bursts/non-verbal vocal gestures as salient event candidates. The proposed model is validated based on the development dataset of AVEC 2018 challenge development dataset and achieves the highest accuracy of valence prediction among single modal methods based on speech or speech-transcript. We tested this model on cross-cultural settings provided by AVEC 2018 challenge test set, and the model performs reasonably well for an unseen culture as well and outperform speech-based baselines. Further we explore inclusion of interlocutor related cues to the proposed model and decision level fusion with existing features. Since the proposed model was evaluated solely based on laughter and slight laughter affect bursts which were nominated as salient by proposed saliency constrains of the model, the results presented highlight the significance of aforementioned gestures in human emotion expression and perception","PeriodicalId":123523,"journal":{"name":"Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop","volume":"396 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123391457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Bipolar Disorder Recognition with Histogram Features of Arousal and Body Gestures 唤醒和肢体动作的直方图特征识别双相情感障碍

Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop Pub Date : 2018-10-15 DOI: 10.1145/3266302.3266308

Le Yang, Yan Li, Haifeng Chen, D. Jiang, Meshia Cédric Oveneke, H. Sahli

{"title":"Bipolar Disorder Recognition with Histogram Features of Arousal and Body Gestures","authors":"Le Yang, Yan Li, Haifeng Chen, D. Jiang, Meshia Cédric Oveneke, H. Sahli","doi":"10.1145/3266302.3266308","DOIUrl":"https://doi.org/10.1145/3266302.3266308","url":null,"abstract":"This paper targets the Bipolar Disorder Challenge (BDC) task of Audio Visual Emotion Challenge (AVEC) 2018. Firstly, two novel features are proposed: 1) a histogram based arousal feature, in which the continuous arousal values are estimated from the audio cues by a Long Short-Term Memory Recurrent Neural Network (LSTM-RNN) model; 2) a Histogram of Displacement (HDR) based upper body posture feature, which characterizes the displacement and velocity of the key body points in the video segment. In addition, we propose a multi-stream bipolar disorder classification framework with Deep Neural Networks (DNNs) and a Random Forest, and adopt the ensemble learning strategy to alleviate the possible over-fitting problem due to the limited training data. Experimental results show that the proposed arousal feature and upper body posture feature are discriminative for different bipolar episodes, and our proposed framework achieves promising classification results on the development set, with the unweighted average recall (UAR) of 0.714, which is higher than the baseline result 0.635. On test set evaluation, our system obtains the same UAR (0.574) as the challenge baseline.","PeriodicalId":123523,"journal":{"name":"Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop","volume":"231 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131985934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

Learning an Arousal-Valence Speech Front-End Network using Media Data In-the-Wild for Emotion Recognition 基于媒体数据的唤醒效价语音前端网络的情绪识别研究

Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop Pub Date : 2018-10-15 DOI: 10.1145/3266302.3266306

Chih-Chuan Lu, Jeng-Lin Li, Chi-Chun Lee

{"title":"Learning an Arousal-Valence Speech Front-End Network using Media Data In-the-Wild for Emotion Recognition","authors":"Chih-Chuan Lu, Jeng-Lin Li, Chi-Chun Lee","doi":"10.1145/3266302.3266306","DOIUrl":"https://doi.org/10.1145/3266302.3266306","url":null,"abstract":"Recent progress in speech emotion recognition (SER) technology has benefited from the use of deep learning techniques. However, expensive human annotation and difficulty in emotion database collection make it challenging for rapid deployment of SER across diverse application domains. An initialization - fine-tuning strategy help mitigate these technical challenges. In this work, we propose an initialization network that gears toward SER applications by learning the speech front-end network on a large media data collected in-the-wild jointly with proxy arousal-valence labels that are multimodally derived from audio and text information, termed as the Arousal-Valence Speech Front-End Network (AV-SpNET). The AV-SpNET can then be easily stacked simply with the supervised layers for the target emotion corpus of interest. We evaluate our proposed AV-SpNET on tasks of SER for two separate emotion corpora, the USC IEMOCAP and the NNIME database. The AV-SpNET outperforms other initialization techniques and reach the best overall performances requiring only 75% of the in-domain annotated data. We also observe that generally, by using the AV-SpNET as front-end network, it requires as little as 50% of the fine-tuned data to surpass method based on randomly-initialized network with fine-tuning on the complete training set.","PeriodicalId":123523,"journal":{"name":"Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127226415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop 2018年视听情感挑战与研讨会论文集

Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop Pub Date : 2018-10-15 DOI: 10.1145/3266302

引用次数: 1

Interpersonal Behavior Modeling for Personality, Affect, and Mental States Recognition and Analysis 人格、情感和心理状态识别与分析的人际行为建模

Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop Pub Date : 2018-10-15 DOI: 10.1145/3266302.3266303

Chi-Chun Lee

{"title":"Interpersonal Behavior Modeling for Personality, Affect, and Mental States Recognition and Analysis","authors":"Chi-Chun Lee","doi":"10.1145/3266302.3266303","DOIUrl":"https://doi.org/10.1145/3266302.3266303","url":null,"abstract":"Imagine humans as complex dynamical systems: systems that are characterized by multiple interacting layers of hidden states (e.g., internal processes involving functions of cognition, perception, production, emotion, and social interaction) producing measurable multimodal signals (e.g., body gestures, facial expressions, physiology, and speech). This abstraction of humans with a signals and systems framework naturally brings a synergy between communities of engineering and behavioral sciences. Various research fields have emerged from such an interdisciplinary human-centered effort, e.g., behavioral signal processing [7], social signal processing [10], and affective computing [8], where technological advancements has continuously been made in order to robustly assess and infer individual speaker's states and traits. The complexities in modeling human behavior are centered on the issue of heterogeneity of human behavior. Sources of variability in human behaviors originate from the differences in mechanisms of information encoding (behavior production) and decoding (behavior perception). Furthermore, a key additional layer of complexity exists because human behaviors occur largely during interactions with the environment and agents therein. This interplay, which causes a coupling effect between humans' behaviors, is the essence of interpersonal dynamics. This unique behavior dynamic has been at core not only in human communication studies [2], but further is crucial in automatic characterizing the speaker's social-affective behavior phenomenon (e.g., emotion recognition [4, 5] and personality trait identification [3, 9]) and in understanding interactions of those typical, distressed to disordered manifestations [1, 6].","PeriodicalId":123523,"journal":{"name":"Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125709246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0