Licai Sun, Zheng Lian, J. Tao, Bin Liu, Mingyue Niu
{"title":"Multi-modal Continuous Dimensional Emotion Recognition Using Recurrent Neural Network and Self-Attention Mechanism","authors":"Licai Sun, Zheng Lian, J. Tao, Bin Liu, Mingyue Niu","doi":"10.1145/3423327.3423672","DOIUrl":"https://doi.org/10.1145/3423327.3423672","url":null,"abstract":"Automatic perception and understanding of human emotion or sentiment has a wide range of applications and has attracted increasing attention nowadays. The Multimodal Sentiment Analysis in Real-life Media (MuSe) 2020 provides a testing bed for recognizing human emotion or sentiment from multiple modalities (audio, video, and text) in the wild scenario. In this paper, we present our solutions to the MuSe-Wild sub-challenge of MuSe 2020. The goal of this sub-challenge is to perform continuous emotion (arousal and valence) predictions on a car review database, Muse-CaR. To this end, we first extract both handcrafted features and deep representations from multiple modalities. Then, we utilize the Long Short-Term Memory (LSTM) recurrent neural network as well as the self-attention mechanism to model the complex temporal dependencies in the sequence. The Concordance Correlation Coefficient (CCC) loss is employed to guide the model to learn local variations and the global trend of emotion simultaneously. Finally, two fusion strategies, early fusion and late fusion, are adopted to further boost the model's performance by exploiting complementary information from different modalities. Our proposed method achieves CCC of 0.4726 and 0.5996 for arousal and valence respectively on the test set, which outperforms the baseline system with corresponding CCC of 0.2834 and 0.2431.","PeriodicalId":246071,"journal":{"name":"Proceedings of the 1st International on Multimodal Sentiment Analysis in Real-life Media Challenge and Workshop","volume":"144 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124600811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Extending Multimodal Emotion Recognition with Biological Signals: Presenting a Novel Dataset and Recent Findings","authors":"Alice Baird","doi":"10.1145/3423327.3423512","DOIUrl":"https://doi.org/10.1145/3423327.3423512","url":null,"abstract":"Multimodal fusion has shown great promise in recent literature, particularly for audio dominant tasks. In this talk, we outline a the finding from a recently developed multimodal dataset, and discuss the promise of fusing biological signals with speech for continuous recognition of the emotional dimensions of valence and arousal in the context of public speaking. As well as this, we discuss the advantage of cross-language (German and English) analysis by training language-independent models and testing them on speech from various native and non-native groupings. For the emotion recognition task used as a case study, a Long Short-Term Memory - Recurrent Neural Network (LSTM-RNN) architecture with a self-attention layer is used.","PeriodicalId":246071,"journal":{"name":"Proceedings of the 1st International on Multimodal Sentiment Analysis in Real-life Media Challenge and Workshop","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126470465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multimodal Social Media Mining","authors":"Y. Kompatsiaris","doi":"10.1145/3423327.3423511","DOIUrl":"https://doi.org/10.1145/3423327.3423511","url":null,"abstract":"Social media have transformed the Web into an interactive sharing platform where users upload data and media, comment on, and share this content within their social circles. The large-scale availability of user-generated content in social media platforms has opened up new possibilities for studying and understanding real-world phenomena, trends and events. The objective of this talk is to provide an overview of social media mining, which offers a unique opportunity to discover, collect, and extract relevant information in order to provide useful insights. It will include key challenges and issues, such as fighting misinformation, data collection, analysis and visualization components, applications, results and demonstrations from multiple areas ranging from news to environmental and security ones.","PeriodicalId":246071,"journal":{"name":"Proceedings of the 1st International on Multimodal Sentiment Analysis in Real-life Media Challenge and Workshop","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114424534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Vehicle Interiors as Sensate Environments","authors":"M. Würtenberger","doi":"10.1145/3423327.3423509","DOIUrl":"https://doi.org/10.1145/3423327.3423509","url":null,"abstract":"The research field of biologically inspired and cognitive systems is currently gaining increasing interest. However, modern vehicles and their architectures are still dominated by traditional, engineered systems. This talk will give an industrial perspective on potential usage of biologicallyinspired systems and cognitive architectures in future vehicles. A vehicle's interior can be considered a highly interactive sensate environment. With the advent of highly automated driving, even more emphasis will be on this smart space and the corresponding user experience. New interior layouts become possible, with the attention shifting from the driver to the wellbeing and comfort of rider passengers in highly reconfigurable interior layouts. Tactile intelligence in particular will add an exciting new modality and help address challenges of safe human-robot coexistence. By focusing on opportunities for such approaches but also by pointing out challenges with respect to industrial requirements, the goal of this talk is to initiate and stimulate discussions regarding integration of cognitive systems in future vehicle architectures.","PeriodicalId":246071,"journal":{"name":"Proceedings of the 1st International on Multimodal Sentiment Analysis in Real-life Media Challenge and Workshop","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125228149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ruichen Li, Jinming Zhao, Jingwen Hu, Shuai Guo, Qin Jin
{"title":"Multi-modal Fusion for Video Sentiment Analysis","authors":"Ruichen Li, Jinming Zhao, Jingwen Hu, Shuai Guo, Qin Jin","doi":"10.1145/3423327.3423671","DOIUrl":"https://doi.org/10.1145/3423327.3423671","url":null,"abstract":"Automatic sentiment analysis can support revealing a subject's emotional state and opinion tendency toward an entity. In this paper, we present our solutions for the MuSe-Wild sub-challenge of Multimodal Sentiment Analysis in Real-life Media (MuSe) 2020. The videos in this challenge are collected from YouTube about emotional car reviews. In the scenarios, the speaker's sentiment can be conveyed in different modalities including acoustic, visual, and textual modalities. Due to the complementarity of different modalities, the fusion of the multiple modalities has a large impact on sentiment analysis. In this paper, we highlight two aspects of our solutions: 1) we explore various low-level and high-level features from different modalities for emotional state recognition, such as expert-defined low-level descriptors (LLD) and deep learned features, etc. 2) we propose several effective multi-modal fusion strategies to make full use of the different modalities. Our solutions achieve the best CCC performance of 0.4346 and 0.4513 on arousal and valence respectively on the challenge testing set, which significantly outperforms the baseline system with corresponding CCC of 0.2843 and 0.2413 on arousal and valence. The experimental results show that our proposed various effective representations of different modalities and fusion strategies have a strong generalization ability and can bring more robust performance.","PeriodicalId":246071,"journal":{"name":"Proceedings of the 1st International on Multimodal Sentiment Analysis in Real-life Media Challenge and Workshop","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131881803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Personalized Machine Learning for Human-centered Machine Intelligence","authors":"Ognjen Rudovic","doi":"10.1145/3423327.3423510","DOIUrl":"https://doi.org/10.1145/3423327.3423510","url":null,"abstract":"Recent developments in AI and Machine Learning (ML) are revolutionizing traditional technologies for health and education by enabling more intelligent therapeutic and learning tools that can automatically perceive and predict user's behavior (e.g. from videos) or health status from user's past clinical data. To date, most of these tools still rely on traditional 'on-size-fits-all' ML paradigm, rendering generic learning algorithms that, in most cases, are suboptimal on the individual level, mainly because of the large heterogeneity of the target population. Furthermore, such approach may provide misleading outcomes as it fails to account for context in which target behaviors/clinical data are being analyzed. This calls for new human-centered machine intelligence enabled by ML algorithms that are tailored to each individual and context under the study. In this talk, I will present the key ideas and applications of Personalized Machine Learning (PML) framework specifically designed to tackle those challenges. The applications range from personalized forecasting of Alzheimer's related cognitive decline, using Gaussian Process models, to Personalized Deep Neural Networks, designed for classification of facial affect of typical individuals using the notion of meta-learning and reinforcement learning. I will then describe in more detail how this framework can be used to tackle a challenging problem of robot perception of affect and engagement in autism therapy. Lastly, I will discuss the future research on PML and human-centered ML design, outlining challenges and opportunities.","PeriodicalId":246071,"journal":{"name":"Proceedings of the 1st International on Multimodal Sentiment Analysis in Real-life Media Challenge and Workshop","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123949369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Changzeng Fu, Jiaqi Shi, Chaoran Liu, C. Ishi, H. Ishiguro
{"title":"AAEC: An Adversarial Autoencoder-based Classifier for Audio Emotion Recognition","authors":"Changzeng Fu, Jiaqi Shi, Chaoran Liu, C. Ishi, H. Ishiguro","doi":"10.1145/3423327.3423669","DOIUrl":"https://doi.org/10.1145/3423327.3423669","url":null,"abstract":"In recent years, automatic emotion recognition has attracted the attention of researchers because of its great effects and wide implementations in supporting humans' activities. Given that the data about emotions is difficult to collect and organize into a large database like the dataset of text or images, the true distribution would be difficult to be completely covered by the training set, which affects the model's robustness and generalization in subsequent applications. In this paper, we proposed a model, Adversarial Autoencoder-based Classifier (AAEC), that can not only augment the data within real data distribution but also reasonably extend the boundary of the current data distribution to a possible space. Such an extended space would be better to fit the distribution of training and testing sets. In addition to comparing with baseline models, we modified our proposed model into different configurations and conducted a comprehensive self-comparison with audio modality. The results of our experiment show that our proposed model outperforms the baselines.","PeriodicalId":246071,"journal":{"name":"Proceedings of the 1st International on Multimodal Sentiment Analysis in Real-life Media Challenge and Workshop","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117021144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lukas Stappen, Alice Baird, Georgios Rizos, Panagiotis Tzirakis, Xinchen Du, Felix Hafner, Lea Schumann, Adria Mallol-Ragolta, Björn Schuller, I. Lefter, E. Cambria, Y. Kompatsiaris
{"title":"MuSe 2020 Challenge and Workshop: Multimodal Sentiment Analysis, Emotion-target Engagement and Trustworthiness Detection in Real-life Media: Emotional Car Reviews in-the-wild","authors":"Lukas Stappen, Alice Baird, Georgios Rizos, Panagiotis Tzirakis, Xinchen Du, Felix Hafner, Lea Schumann, Adria Mallol-Ragolta, Björn Schuller, I. Lefter, E. Cambria, Y. Kompatsiaris","doi":"10.1145/3423327.3423673","DOIUrl":"https://doi.org/10.1145/3423327.3423673","url":null,"abstract":"Multimodal Sentiment Analysis in Real-life Media (MuSe) 2020 is a Challenge-based Workshop focusing on the tasks of sentiment recognition, as well as emotion-target engagement and trustworthiness detection by means of more comprehensively integrating the audio-visual and language modalities. The purpose of MuSe 2020 is to bring together communities from different disciplines; mainly, the audio-visual emotion recognition community (signal-based), and the sentiment analysis community (symbol-based). We present three distinct sub-challenges: MuSe-Wild, which focuses on continuous emotion (arousal and valence) prediction; MuSe-Topic, in which participants recognise 10 domain-specific topics as the target of 3-class (low, medium, high) emotions; and MuSe-Trust, in which the novel aspect of trustworthiness is to be predicted. In this paper, we provide detailed information on MuSe-CAR, the first of its kind in-the-wild database, which is utilised for the challenge, as well as the state-of-the-art features and modelling approaches applied. For each sub-challenge, a competitive baseline for participants is set; namely, on test we report for MuSe-Wild a combined (valence and arousal) CCC of .2568, for MuSe-Topic a score (computed as 0.34 * UAR + 0.66 * F1) of 76.78 % on the 10-class topic and 40.64 % on the 3-class emotion prediction, and for MuSe-Trust a CCC of .4359.","PeriodicalId":246071,"journal":{"name":"Proceedings of the 1st International on Multimodal Sentiment Analysis in Real-life Media Challenge and Workshop","volume":"281 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122942711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Amiriparian, Pawel Winokurow, Vincent Karas, Sandra Ottl, Maurice Gerczuk, Björn Schuller
{"title":"Unsupervised Representation Learning with Attention and Sequence to Sequence Autoencoders to Predict Sleepiness From Speech","authors":"S. Amiriparian, Pawel Winokurow, Vincent Karas, Sandra Ottl, Maurice Gerczuk, Björn Schuller","doi":"10.1145/3423327.3423670","DOIUrl":"https://doi.org/10.1145/3423327.3423670","url":null,"abstract":"Motivated by the attention mechanism of the human visual system and recent developments in the field of machine translation, we introduce our attention-based and recurrent sequence to sequence autoencoders for fully unsupervised representation learning from audio files. In particular, we test the efficacy of our novel approach on the task of speech-based sleepiness recognition. We evaluate the learnt representations from both autoencoders, and conduct an early fusion to ascertain possible complementarity between them. In our frameworks, we first extract Mel-spectrograms from raw audio. Second, we train recurrent autoencoders on these spectrograms which are considered as time-dependent frequency vectors. Afterwards, we extract the activations of specific fully connected layers of the autoencoders which represent the learnt features of spectrograms for the corresponding audio instances. Finally, we train support vector regressors on these representations to obtain the predictions. On the development partition of the data, we achieve Spearman's correlation coefficients of .324, .283, and .320 with the targets on the Karolinska Sleepiness Scale by utilising attention and non-attention autoencoders, and the fusion of both autoencoders' representations, respectively. In the same order, we achieve .311, .359, and .367 Spearman's correlation coefficients on the test data, indicating the suitability of our proposed fusion strategy.","PeriodicalId":246071,"journal":{"name":"Proceedings of the 1st International on Multimodal Sentiment Analysis in Real-life Media Challenge and Workshop","volume":"298 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115925521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"End2You","authors":"Panagiotis Tzirakis","doi":"10.1145/3423327.3423513","DOIUrl":"https://doi.org/10.1145/3423327.3423513","url":null,"abstract":"Multimodal profiling is a fundamental component towards a complete interaction between human and machine. This is an important task for intelligent systems as they can automatically sense and adapt their responses according to the human behavior. The last 10 years, several advancements have been accomplished with the use of Deep Neural Networks (DNNs) in several areas including but not limited to affect recognition[1,2]. Convolution and recurrent neural networks are core components of DNNs that have been extensively used to extract robust spatial and temporal features, accordingly. To this end, we introduce End2You[3] an open-source toolkit implemented in Python and based on Tensorflow. It provides capabilities to train and evaluate models in an end-to-end manner, i.e., using raw input. It supports input from raw audio, visual, physiological or other types of information, and the output can be of an arbitrary representation, for either classification or regression tasks. Well known audio- and visual-model implementations are provided including ResNet[4], and MobileNet[5]. It can also capture the temporal dynamics in the signal, utilizing recurrent neural networks such as Long Short-Term Memory (LSTM). The toolkit also provides pretrained unimodal and multimodal models for the emotion recognition task using the RECOLA dataset[6]. To our knowledge, this is the first toolkit that provides generic end-to-end learning for profiling capabilities in either unimodal or multimodal cases. We depict results of the toolkit on the RECOLA dataset and show how it can be used on different datasets.","PeriodicalId":246071,"journal":{"name":"Proceedings of the 1st International on Multimodal Sentiment Analysis in Real-life Media Challenge and Workshop","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115399065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}