Spatial Relational Attention Using Fully Convolutional Networks for Image Caption Generation

Int. J. Comput. Intell. Appl. Pub Date : 2020-06-01 DOI:10.1142/s146902682050011x

Teng Jiang, Liang Gong, Yupu Yang

引用次数: 1

Abstract

Attention-based encoder–decoder framework has greatly improved image caption generation tasks. The attention mechanism plays a transitional role by transforming static image features into sequential captions. To generate reasonable captions, it is of great significance to detect spatial characteristics of images. In this paper, we propose a spatial relational attention approach to consider spatial positions and attributes. Image features are firstly weighted by the attention mechanism. Then they are concatenated with contextual features to form a spatial–visual tensor. The tensor is feature extracted by a fully convolutional network to produce visual concepts for the decoder network. The fully convolutional layers maintain spatial topology of images. Experiments conducted on the three benchmark datasets, namely Flickr8k, Flickr30k and MSCOCO, demonstrate the effectiveness of our proposed approach. Captions generated by the spatial relational attention method precisely capture spatial relations of objects.

查看原文本刊更多论文

使用全卷积网络生成图像标题的空间关系注意

基于注意力的编码器-解码器框架极大地改善了图像标题生成任务。注意机制通过将静态图像特征转化为顺序字幕，起到过渡作用。为了生成合理的字幕，检测图像的空间特征是非常重要的。本文提出了一种考虑空间位置和属性的空间关系注意方法。首先通过注意机制对图像特征进行加权。然后将它们与上下文特征连接起来，形成一个空间视觉张量。张量由全卷积网络提取特征，为解码器网络生成视觉概念。全卷积层保持图像的空间拓扑结构。在Flickr8k、Flickr30k和MSCOCO三个基准数据集上进行的实验证明了本文方法的有效性。空间关系关注法生成的标题能够准确捕捉物体的空间关系。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Int. J. Comput. Intell. Appl.

自引率

0.00%

发文量