FERNIE-ViL: Facial Expression Enhanced Vision-and-Language Model

2021 IEEE 20th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC) Pub Date : 2021-10-29 DOI:10.1109/ICCICC53683.2021.9811331

Soo-Ryeon Lee, Dohyun Kim, Mingyu Lee, SangKeun Lee

引用次数: 0

Abstract

Visual cognition requires analyzing actions, intentions, and emotions of persons in a given image. Visual Commonsense Reasoning (VCR) is a task that selects rationales and answers to questions for given images. In VCR, facial expressions are important nonverbal signals because they convey emotions and intentions in human interactions. However, ERNIE-ViL and UNITER, which are vision-and-language models to get image and text representations, do not learn them. We find that ERNIE-ViL and UNITER are vulnerable to the problem of identifying emotions. In this paper, therefore, we propose facial expression recognition FERNIE-ViL, which adapts a facial expression recognition module to the existing vision-and-language model. Experimental results (2.4% point improvement on VCR Q→A and 0.3% point improvement on VCR QA→R) demonstrate that our method can enhance visual commonsense reasoning by understanding human interactions.

查看原文本刊更多论文

面部表情增强视觉和语言模型

视觉认知需要分析给定图像中人物的行为、意图和情感。视觉常识推理(VCR)是一项为给定图像选择基本原理和问题答案的任务。在VCR中，面部表情是重要的非语言信号，因为它们传达了人类互动中的情感和意图。然而，用于获取图像和文本表示的视觉和语言模型ERNIE-ViL和UNITER不学习它们。我们发现ERNIE-ViL和UNITER容易出现识别情绪的问题。因此，本文提出了面部表情识别FERNIE-ViL，它将一个面部表情识别模块适配到现有的视觉语言模型中。实验结果(VCR Q→A提高2.4%，VCR QA→R提高0.3%)表明，我们的方法可以通过理解人类互动来增强视觉常识推理。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE 20th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC)

自引率

0.00%

发文量