G3raphGround: Graph-Based Language Grounding

2019 IEEE/CVF International Conference on Computer Vision (ICCV) Pub Date : 2019-10-01 DOI:10.1109/ICCV.2019.00438

Mohit Bajaj, Lanjun Wang, L. Sigal

引用次数: 49

Abstract

In this paper we present an end-to-end framework for grounding of phrases in images. In contrast to previous works, our model, which we call GraphGround, uses graphs to formulate more complex, non-sequential dependencies among proposal image regions and phrases. We capture intra-modal dependencies using a separate graph neural network for each modality (visual and lingual), and then use conditional message-passing in another graph neural network to fuse their outputs and capture cross-modal relationships. This final representation results in grounding decisions. The framework supports many-to-many matching and is able to ground single phrase to multiple image regions and vice versa. We validate our design choices through a series of ablation studies and illustrate state-of-the-art performance on Flickr30k and ReferIt Game benchmark datasets.

查看原文本刊更多论文

G3raphGround:基于图形的语言基础

在本文中，我们提出了一个端到端的框架，用于图像中短语的基础。与之前的作品相比，我们的模型(我们称之为GraphGround)使用图形来制定提案图像区域和短语之间更复杂、非顺序的依赖关系。我们为每个模态(视觉和语言)使用单独的图神经网络捕获模态内依赖关系，然后在另一个图神经网络中使用条件消息传递来融合它们的输出并捕获跨模态关系。这种最终的表示导致接地决策。该框架支持多对多匹配，并能够将单个短语接地到多个图像区域，反之亦然。我们通过一系列的研究来验证我们的设计选择，并在Flickr30k和ReferIt Game基准数据集上展示了最先进的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

自引率

0.00%

发文量