Soo-Ryeon Lee, Dohyun Kim, Mingyu Lee, SangKeun Lee
{"title":"FERNIE-ViL: Facial Expression Enhanced Vision-and-Language Model","authors":"Soo-Ryeon Lee, Dohyun Kim, Mingyu Lee, SangKeun Lee","doi":"10.1109/ICCICC53683.2021.9811331","DOIUrl":null,"url":null,"abstract":"Visual cognition requires analyzing actions, intentions, and emotions of persons in a given image. Visual Commonsense Reasoning (VCR) is a task that selects rationales and answers to questions for given images. In VCR, facial expressions are important nonverbal signals because they convey emotions and intentions in human interactions. However, ERNIE-ViL and UNITER, which are vision-and-language models to get image and text representations, do not learn them. We find that ERNIE-ViL and UNITER are vulnerable to the problem of identifying emotions. In this paper, therefore, we propose facial expression recognition FERNIE-ViL, which adapts a facial expression recognition module to the existing vision-and-language model. Experimental results (2.4% point improvement on VCR Q→A and 0.3% point improvement on VCR QA→R) demonstrate that our method can enhance visual commonsense reasoning by understanding human interactions.","PeriodicalId":101653,"journal":{"name":"2021 IEEE 20th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE 20th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCICC53683.2021.9811331","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Visual cognition requires analyzing actions, intentions, and emotions of persons in a given image. Visual Commonsense Reasoning (VCR) is a task that selects rationales and answers to questions for given images. In VCR, facial expressions are important nonverbal signals because they convey emotions and intentions in human interactions. However, ERNIE-ViL and UNITER, which are vision-and-language models to get image and text representations, do not learn them. We find that ERNIE-ViL and UNITER are vulnerable to the problem of identifying emotions. In this paper, therefore, we propose facial expression recognition FERNIE-ViL, which adapts a facial expression recognition module to the existing vision-and-language model. Experimental results (2.4% point improvement on VCR Q→A and 0.3% point improvement on VCR QA→R) demonstrate that our method can enhance visual commonsense reasoning by understanding human interactions.