Fusical: Multimodal Fusion for Video Sentiment

Proceedings of the 2020 International Conference on Multimodal Interaction Pub Date : 2020-10-21 DOI:10.1145/3382507.3417966

Bo Jin, L. Abdelrahman, C. Chen, Amil Khanzada

引用次数: 3

Abstract

Determining the emotional sentiment of a video remains a challenging task that requires multimodal, contextual understanding of a situation. In this paper, we describe our entry into the EmotiW 2020 Audio-Video Group Emotion Recognition Challenge to classify group videos containing large variations in language, people, and environment, into one of three sentiment classes. Our end-to-end approach consists of independently training models for different modalities, including full-frame video scenes, human body keypoints, embeddings extracted from audio clips, and image-caption word embeddings. Novel combinations of modalities, such as laughter and image-captioning, and transfer learning are further developed. We use fully-connected (FC) fusion ensembling to aggregate the modalities, achieving a best test accuracy of 63.9% which is 16 percentage points higher than that of the baseline ensemble.

查看原文本刊更多论文

Fusical:视频情感的多模态融合

确定视频的情感情绪仍然是一项具有挑战性的任务，需要对情况进行多模式和上下文理解。在本文中，我们描述了我们进入EmotiW 2020音频-视频群体情感识别挑战的过程，将包含语言、人物和环境大变化的群体视频分类为三种情感类之一。我们的端到端方法由不同模态的独立训练模型组成，包括全帧视频场景、人体关键点、从音频片段中提取的嵌入和图像标题词嵌入。新的模式组合，如笑声和图像字幕，以及迁移学习得到进一步发展。我们使用全连接(FC)融合集成来聚合模式，达到了63.9%的最佳测试精度，比基线集成高16个百分点。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2020 International Conference on Multimodal Interaction

自引率

0.00%

发文量