Image captioning with deep LSTM based on sequential residual

2017 IEEE International Conference on Multimedia and Expo (ICME) Pub Date : 2017-07-10 DOI:10.1109/ICME.2017.8019408

Kaisheng Xu, Hanli Wang, Pengjie Tang

引用次数: 22

Abstract

Image captioning is a fundamental task which requires semantic understanding of images and the ability of generating description sentences with proper and correct structure. In consideration of the problem that language models are always shallow in modern image caption frameworks, a deep residual recurrent neural network is proposed in this work with the following two contributions. First, an easy-to-train deep stacked Long Short Term Memory (LSTM) language model is designed to learn the residual function of output distributions by adding identity mappings to multi-layer LSTMs. Second, in order to overcome the over-fitting problem caused by larger-scale parameters in deeper LSTM networks, a novel temporal Dropout method is proposed into LSTM. The experimental results on the benchmark MSCOCO and Flickr30K datasets demonstrate that the proposed model achieves the state-of-the-art performances with 101.1 in CIDEr on MSCOCO and 22.9 in B-4 on Flickr30K, respectively.

查看原文本刊更多论文

基于序列残差的深度LSTM图像字幕

图像字幕是一项基本任务，它需要对图像进行语义理解，并能够生成结构正确的描述句子。针对现代图像标题框架中语言模型过于肤浅的问题，本文提出了一种深度残差递归神经网络。首先，设计了一种易于训练的深度堆叠长短期记忆(LSTM)语言模型，通过在多层LSTM中添加身份映射来学习输出分布的残差函数。其次，为了克服深层LSTM网络中较大尺度参数导致的过拟合问题，提出了一种新的LSTM时间Dropout方法。在MSCOCO和Flickr30K基准数据集上的实验结果表明，所提出的模型在MSCOCO和Flickr30K上分别达到了101.1和22.9的CIDEr性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 IEEE International Conference on Multimedia and Expo (ICME)

自引率

0.00%

发文量