Explainable Video Entailment with Grounded Visual Evidence

2021 IEEE/CVF International Conference on Computer Vision (ICCV) Pub Date : 2021-10-01 DOI:10.1109/ICCV48922.2021.00203

Junwen Chen, Yu Kong Golisano

引用次数: 7

Abstract

Video entailment aims at determining if a hypothesis textual statement is entailed or contradicted by a premise video. The main challenge of video entailment is that it requires fine-grained reasoning to understand the complex and long story-based videos. To this end, we propose to incorporate visual grounding to the entailment by explicitly linking the entities described in the statement to the evidence in the video. If the entities are grounded in the video, we enhance the entailment judgment by focusing on the frames where the entities occur. Besides, in the entailment dataset, the entailed/contradictory (also named as real/fake) statements are formed in pairs with subtle discrepancy, which allows an add-on explanation module to predict which words or phrases make the statement contradictory to the video and regularize the training of the entailment judgment. Experimental results demonstrate that our approach outperforms the state-of-the-art methods.

查看原文本刊更多论文

可解释的视频蕴涵与扎实的视觉证据

视频蕴涵的目的是确定假设文本陈述是否包含或与前提视频相矛盾。视频蕴意的主要挑战是，它需要细粒度的推理来理解复杂和冗长的基于故事的视频。为此，我们建议通过明确地将陈述中描述的实体与视频中的证据联系起来，将视觉基础纳入蕴涵。如果实体是基于视频的，我们通过关注实体出现的帧来增强蕴涵判断。此外，在蕴涵数据集中，蕴涵/矛盾(也称为真实/虚假)语句以微妙的差异成对形成，这允许附加的解释模块预测哪些单词或短语使语句与视频相矛盾，并规范蕴涵判断的训练。实验结果表明，我们的方法优于最先进的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE/CVF International Conference on Computer Vision (ICCV)

自引率

0.00%

发文量