HAIR: Hierarchical Visual-Semantic Relational Reasoning for Video Question Answering

2021 IEEE/CVF International Conference on Computer Vision (ICCV) Pub Date : 2021-10-01 DOI:10.1109/ICCV48922.2021.00172

Fei Liu, Jing Liu, Weining Wang, Hanqing Lu

{"title":"HAIR: Hierarchical Visual-Semantic Relational Reasoning for Video Question Answering","authors":"Fei Liu, Jing Liu, Weining Wang, Hanqing Lu","doi":"10.1109/ICCV48922.2021.00172","DOIUrl":null,"url":null,"abstract":"Relational reasoning is at the heart of video question answering. However, existing approaches suffer from several common limitations: (1) they only focus on either object-level or frame-level relational reasoning, and fail to integrate the both; and (2) they neglect to leverage semantic knowledge for relational reasoning. In this work, we propose a Hierarchical VisuAl-Semantic RelatIonal Reasoning (HAIR) framework to address these limitations. Specifically, we present a novel graph memory mechanism to perform relational reasoning, and further develop two types of graph memory: a) visual graph memory that leverages visual information of video for relational reasoning; b) semantic graph memory that is specifically designed to explicitly leverage semantic knowledge contained in the classes and attributes of video objects, and perform relational reasoning in the semantic space. Taking advantage of both graph memory mechanisms, we build a hierarchical framework to enable visual-semantic relational reasoning from object level to frame level. Experiments on four challenging benchmark datasets show that the proposed framework leads to state-of-the-art performance, with fewer parameters and faster inference speed. Besides, our approach also shows superior performance on other video+language task.","PeriodicalId":6820,"journal":{"name":"2021 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"7 1","pages":"1678-1687"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"29","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE/CVF International Conference on Computer Vision (ICCV)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCV48922.2021.00172","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 29

Abstract

Relational reasoning is at the heart of video question answering. However, existing approaches suffer from several common limitations: (1) they only focus on either object-level or frame-level relational reasoning, and fail to integrate the both; and (2) they neglect to leverage semantic knowledge for relational reasoning. In this work, we propose a Hierarchical VisuAl-Semantic RelatIonal Reasoning (HAIR) framework to address these limitations. Specifically, we present a novel graph memory mechanism to perform relational reasoning, and further develop two types of graph memory: a) visual graph memory that leverages visual information of video for relational reasoning; b) semantic graph memory that is specifically designed to explicitly leverage semantic knowledge contained in the classes and attributes of video objects, and perform relational reasoning in the semantic space. Taking advantage of both graph memory mechanisms, we build a hierarchical framework to enable visual-semantic relational reasoning from object level to frame level. Experiments on four challenging benchmark datasets show that the proposed framework leads to state-of-the-art performance, with fewer parameters and faster inference speed. Besides, our approach also shows superior performance on other video+language task.

查看原文本刊更多论文

毛发:视频问答的层次视觉语义关系推理

关系推理是视频问答的核心。然而，现有的方法存在几个常见的局限性:(1)它们只关注对象级或框架级的关系推理，而不能将两者集成;(2)他们忽略了利用语义知识进行关系推理。在这项工作中，我们提出了一个层次视觉语义关系推理(HAIR)框架来解决这些限制。具体来说，我们提出了一种新的图记忆机制来执行关系推理，并进一步发展了两种类型的图记忆:a)利用视频的视觉信息进行关系推理的视觉图记忆;B)语义图内存，专门设计用于显式地利用包含在视频对象的类和属性中的语义知识，并在语义空间中执行关系推理。利用这两种图形记忆机制，我们构建了一个分层框架，以实现从对象级到框架级的视觉语义关系推理。在四个具有挑战性的基准数据集上的实验表明，所提出的框架具有最先进的性能，参数更少，推理速度更快。此外，我们的方法在其他视频+语言任务上也表现出优异的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE/CVF International Conference on Computer Vision (ICCV)

自引率

0.00%

发文量