G3raphGround:基于图形的语言基础

2019 IEEE/CVF International Conference on Computer Vision (ICCV) Pub Date : 2019-10-01 DOI:10.1109/ICCV.2019.00438

Mohit Bajaj, Lanjun Wang, L. Sigal

{"title":"G3raphGround:基于图形的语言基础","authors":"Mohit Bajaj, Lanjun Wang, L. Sigal","doi":"10.1109/ICCV.2019.00438","DOIUrl":null,"url":null,"abstract":"In this paper we present an end-to-end framework for grounding of phrases in images. In contrast to previous works, our model, which we call GraphGround, uses graphs to formulate more complex, non-sequential dependencies among proposal image regions and phrases. We capture intra-modal dependencies using a separate graph neural network for each modality (visual and lingual), and then use conditional message-passing in another graph neural network to fuse their outputs and capture cross-modal relationships. This final representation results in grounding decisions. The framework supports many-to-many matching and is able to ground single phrase to multiple image regions and vice versa. We validate our design choices through a series of ablation studies and illustrate state-of-the-art performance on Flickr30k and ReferIt Game benchmark datasets.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"14 1","pages":"4280-4289"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"49","resultStr":"{\"title\":\"G3raphGround: Graph-Based Language Grounding\",\"authors\":\"Mohit Bajaj, Lanjun Wang, L. Sigal\",\"doi\":\"10.1109/ICCV.2019.00438\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper we present an end-to-end framework for grounding of phrases in images. In contrast to previous works, our model, which we call GraphGround, uses graphs to formulate more complex, non-sequential dependencies among proposal image regions and phrases. We capture intra-modal dependencies using a separate graph neural network for each modality (visual and lingual), and then use conditional message-passing in another graph neural network to fuse their outputs and capture cross-modal relationships. This final representation results in grounding decisions. The framework supports many-to-many matching and is able to ground single phrase to multiple image regions and vice versa. We validate our design choices through a series of ablation studies and illustrate state-of-the-art performance on Flickr30k and ReferIt Game benchmark datasets.\",\"PeriodicalId\":6728,\"journal\":{\"name\":\"2019 IEEE/CVF International Conference on Computer Vision (ICCV)\",\"volume\":\"14 1\",\"pages\":\"4280-4289\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"49\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE/CVF International Conference on Computer Vision (ICCV)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCV.2019.00438\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCV.2019.00438","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 49

摘要

在本文中，我们提出了一个端到端的框架，用于图像中短语的基础。与之前的作品相比，我们的模型(我们称之为GraphGround)使用图形来制定提案图像区域和短语之间更复杂、非顺序的依赖关系。我们为每个模态(视觉和语言)使用单独的图神经网络捕获模态内依赖关系，然后在另一个图神经网络中使用条件消息传递来融合它们的输出并捕获跨模态关系。这种最终的表示导致接地决策。该框架支持多对多匹配，并能够将单个短语接地到多个图像区域，反之亦然。我们通过一系列的研究来验证我们的设计选择，并在Flickr30k和ReferIt Game基准数据集上展示了最先进的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

G3raphGround: Graph-Based Language Grounding

In this paper we present an end-to-end framework for grounding of phrases in images. In contrast to previous works, our model, which we call GraphGround, uses graphs to formulate more complex, non-sequential dependencies among proposal image regions and phrases. We capture intra-modal dependencies using a separate graph neural network for each modality (visual and lingual), and then use conditional message-passing in another graph neural network to fuse their outputs and capture cross-modal relationships. This final representation results in grounding decisions. The framework supports many-to-many matching and is able to ground single phrase to multiple image regions and vice versa. We validate our design choices through a series of ablation studies and illustrate state-of-the-art performance on Flickr30k and ReferIt Game benchmark datasets.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

自引率

0.00%

发文量