Geometry-aware Relational Exemplar Attention for Dense Captioning

1st International Workshop on Multimodal Understanding and Learning for Embodied Applications Pub Date : 2019-10-15 DOI:10.1145/3347450.3357656

T. Wang, H. R. Tavakoli, Mats Sjöberg, Jorma T. Laaksonen

{"title":"Geometry-aware Relational Exemplar Attention for Dense Captioning","authors":"T. Wang, H. R. Tavakoli, Mats Sjöberg, Jorma T. Laaksonen","doi":"10.1145/3347450.3357656","DOIUrl":null,"url":null,"abstract":"Dense captioning (DC), which provides a comprehensive context understanding of images by describing all salient visual groundings in an image, facilitates multimodal understanding and learning. As an extension of image captioning, DC is developed to discover richer sets of visual contents and to generate captions of wider diversity and increased details. The state-of-the-art models of DC consist of three stages: (1) region proposals, (2) region classification, and (3) caption generation for each proposal. They are typically built upon the following ideas: (a) guiding the caption generation with image-level features as the context cues along with regional features and (b) refining locations of region proposals with caption information. In this work, we propose (a) a joint visual-textual criterion exploited by the region classifier that further improves both region detection and caption accuracy, and (b) a Geometry aware Relational Exemplar attention (GREatt) mechanism to relate region proposals. The former helps the model learn a region classifier by effectively exploiting both visual groundings and caption descriptions. Rather than treating each region proposal in isolation, the latter relates regions in complementary relations, i.e. contextually dependent, visually supported and geometry relations, to enrich context information in regional representations. We conduct an extensive set of experiments and demonstrate that our proposed model improves the state-of-the-art by at least +5.3% in terms of the mean average precision on the Visual Genome dataset.","PeriodicalId":329495,"journal":{"name":"1st International Workshop on Multimodal Understanding and Learning for Embodied Applications","volume":"27 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"1st International Workshop on Multimodal Understanding and Learning for Embodied Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3347450.3357656","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Dense captioning (DC), which provides a comprehensive context understanding of images by describing all salient visual groundings in an image, facilitates multimodal understanding and learning. As an extension of image captioning, DC is developed to discover richer sets of visual contents and to generate captions of wider diversity and increased details. The state-of-the-art models of DC consist of three stages: (1) region proposals, (2) region classification, and (3) caption generation for each proposal. They are typically built upon the following ideas: (a) guiding the caption generation with image-level features as the context cues along with regional features and (b) refining locations of region proposals with caption information. In this work, we propose (a) a joint visual-textual criterion exploited by the region classifier that further improves both region detection and caption accuracy, and (b) a Geometry aware Relational Exemplar attention (GREatt) mechanism to relate region proposals. The former helps the model learn a region classifier by effectively exploiting both visual groundings and caption descriptions. Rather than treating each region proposal in isolation, the latter relates regions in complementary relations, i.e. contextually dependent, visually supported and geometry relations, to enrich context information in regional representations. We conduct an extensive set of experiments and demonstrate that our proposed model improves the state-of-the-art by at least +5.3% in terms of the mean average precision on the Visual Genome dataset.

查看原文本刊更多论文

密集标题的几何感知关系范例注意

密集字幕(DC)通过描述图像中所有显著的视觉基础，提供了对图像的全面上下文理解，促进了多模态理解和学习。作为图像字幕的延伸，数据中心的发展是为了发现更丰富的视觉内容集，并产生更广泛的多样性和更多的细节字幕。最先进的DC模型包括三个阶段:(1)区域建议，(2)区域分类，(3)为每个建议生成标题。它们通常建立在以下思想之上:(a)使用图像级特征作为上下文线索与区域特征一起指导标题生成;(b)使用标题信息精炼区域建议的位置。在这项工作中，我们提出了(a)由区域分类器利用的联合视觉文本标准，该标准进一步提高了区域检测和标题准确性，以及(b)几何感知关系范例注意(great)机制来关联区域建议。前者通过有效地利用视觉基础和标题描述来帮助模型学习区域分类器。后者不是孤立地处理每个区域建议，而是以互补关系(即上下文依赖、视觉支持和几何关系)将区域联系起来，以丰富区域表示中的上下文信息。我们进行了一组广泛的实验，并证明我们提出的模型在视觉基因组数据集的平均精度方面至少提高了+5.3%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

1st International Workshop on Multimodal Understanding and Learning for Embodied Applications

自引率

0.00%

发文量