Fusical:视频情感的多模态融合

Proceedings of the 2020 International Conference on Multimodal Interaction Pub Date : 2020-10-21 DOI:10.1145/3382507.3417966

Bo Jin, L. Abdelrahman, C. Chen, Amil Khanzada

{"title":"Fusical:视频情感的多模态融合","authors":"Bo Jin, L. Abdelrahman, C. Chen, Amil Khanzada","doi":"10.1145/3382507.3417966","DOIUrl":null,"url":null,"abstract":"Determining the emotional sentiment of a video remains a challenging task that requires multimodal, contextual understanding of a situation. In this paper, we describe our entry into the EmotiW 2020 Audio-Video Group Emotion Recognition Challenge to classify group videos containing large variations in language, people, and environment, into one of three sentiment classes. Our end-to-end approach consists of independently training models for different modalities, including full-frame video scenes, human body keypoints, embeddings extracted from audio clips, and image-caption word embeddings. Novel combinations of modalities, such as laughter and image-captioning, and transfer learning are further developed. We use fully-connected (FC) fusion ensembling to aggregate the modalities, achieving a best test accuracy of 63.9% which is 16 percentage points higher than that of the baseline ensemble.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Fusical: Multimodal Fusion for Video Sentiment\",\"authors\":\"Bo Jin, L. Abdelrahman, C. Chen, Amil Khanzada\",\"doi\":\"10.1145/3382507.3417966\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Determining the emotional sentiment of a video remains a challenging task that requires multimodal, contextual understanding of a situation. In this paper, we describe our entry into the EmotiW 2020 Audio-Video Group Emotion Recognition Challenge to classify group videos containing large variations in language, people, and environment, into one of three sentiment classes. Our end-to-end approach consists of independently training models for different modalities, including full-frame video scenes, human body keypoints, embeddings extracted from audio clips, and image-caption word embeddings. Novel combinations of modalities, such as laughter and image-captioning, and transfer learning are further developed. We use fully-connected (FC) fusion ensembling to aggregate the modalities, achieving a best test accuracy of 63.9% which is 16 percentage points higher than that of the baseline ensemble.\",\"PeriodicalId\":402394,\"journal\":{\"name\":\"Proceedings of the 2020 International Conference on Multimodal Interaction\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-10-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2020 International Conference on Multimodal Interaction\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3382507.3417966\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2020 International Conference on Multimodal Interaction","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3382507.3417966","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

确定视频的情感情绪仍然是一项具有挑战性的任务，需要对情况进行多模式和上下文理解。在本文中，我们描述了我们进入EmotiW 2020音频-视频群体情感识别挑战的过程，将包含语言、人物和环境大变化的群体视频分类为三种情感类之一。我们的端到端方法由不同模态的独立训练模型组成，包括全帧视频场景、人体关键点、从音频片段中提取的嵌入和图像标题词嵌入。新的模式组合，如笑声和图像字幕，以及迁移学习得到进一步发展。我们使用全连接(FC)融合集成来聚合模式，达到了63.9%的最佳测试精度，比基线集成高16个百分点。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Fusical: Multimodal Fusion for Video Sentiment

Determining the emotional sentiment of a video remains a challenging task that requires multimodal, contextual understanding of a situation. In this paper, we describe our entry into the EmotiW 2020 Audio-Video Group Emotion Recognition Challenge to classify group videos containing large variations in language, people, and environment, into one of three sentiment classes. Our end-to-end approach consists of independently training models for different modalities, including full-frame video scenes, human body keypoints, embeddings extracted from audio clips, and image-caption word embeddings. Novel combinations of modalities, such as laughter and image-captioning, and transfer learning are further developed. We use fully-connected (FC) fusion ensembling to aggregate the modalities, achieving a best test accuracy of 63.9% which is 16 percentage points higher than that of the baseline ensemble.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2020 International Conference on Multimodal Interaction

自引率

0.00%

发文量