Attention-based LSTM with Semantic Consistency for Videos Captioning

Proceedings of the 24th ACM international conference on Multimedia Pub Date : 2016-10-01 DOI:10.1145/2964284.2967242

Zhao Guo, Lianli Gao, Jingkuan Song, Xing Xu, Jie Shao, Heng Tao Shen

{"title":"Attention-based LSTM with Semantic Consistency for Videos Captioning","authors":"Zhao Guo, Lianli Gao, Jingkuan Song, Xing Xu, Jie Shao, Heng Tao Shen","doi":"10.1145/2964284.2967242","DOIUrl":null,"url":null,"abstract":"Recent progress in using Long Short-Term Memory (LSTM) for image description has motivated the exploration of their applications for automatically describing video content with natural language sentences. By taking a video as a sequence of features, LSTM model is trained on video-sentence pairs to learn association of a video to a sentence. However, most existing methods compress an entire video shot or frame into a static representation, without considering attention which allows for salient features. Furthermore, most existing approaches model the translating error, but ignore the correlations between sentence semantics and visual content. To tackle these issues, we propose a novel end-to-end framework named aLSTMs, an attention-based LSTM model with semantic consistency, to transfer videos to natural sentences. This framework integrates attention mechanism with LSTM to capture salient structures of video, and explores the correlation between multi-modal representations for generating sentences with rich semantic content. More specifically, we first propose an attention mechanism which uses the dynamic weighted sum of local 2D Convolutional Neural Network (CNN) and 3D CNN representations. Then, a LSTM decoder takes these visual features at time $t$ and the word-embedding feature at time $t$-$1$ to generate important words. Finally, we uses multi-modal embedding to map the visual and sentence features into a joint space to guarantee the semantic consistence of the sentence description and the video visual content. Experiments on the benchmark datasets demonstrate the superiority of our method than the state-of-the-art baselines for video captioning in both BLEU and METEOR.","PeriodicalId":140670,"journal":{"name":"Proceedings of the 24th ACM international conference on Multimedia","volume":"88 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"56","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 24th ACM international conference on Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2964284.2967242","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 56

Abstract

Recent progress in using Long Short-Term Memory (LSTM) for image description has motivated the exploration of their applications for automatically describing video content with natural language sentences. By taking a video as a sequence of features, LSTM model is trained on video-sentence pairs to learn association of a video to a sentence. However, most existing methods compress an entire video shot or frame into a static representation, without considering attention which allows for salient features. Furthermore, most existing approaches model the translating error, but ignore the correlations between sentence semantics and visual content. To tackle these issues, we propose a novel end-to-end framework named aLSTMs, an attention-based LSTM model with semantic consistency, to transfer videos to natural sentences. This framework integrates attention mechanism with LSTM to capture salient structures of video, and explores the correlation between multi-modal representations for generating sentences with rich semantic content. More specifically, we first propose an attention mechanism which uses the dynamic weighted sum of local 2D Convolutional Neural Network (CNN) and 3D CNN representations. Then, a LSTM decoder takes these visual features at time $t$ and the word-embedding feature at time $t$-$1$ to generate important words. Finally, we uses multi-modal embedding to map the visual and sentence features into a joint space to guarantee the semantic consistence of the sentence description and the video visual content. Experiments on the benchmark datasets demonstrate the superiority of our method than the state-of-the-art baselines for video captioning in both BLEU and METEOR.

查看原文本刊更多论文

视频字幕语义一致性的基于注意力的LSTM

利用长短期记忆(LSTM)进行图像描述的最新进展，激发了它们在自然语言句子自动描述视频内容方面的应用探索。将视频作为特征序列，在视频-句子对上训练LSTM模型，学习视频与句子的关联。然而，大多数现有的方法将整个视频镜头或帧压缩成静态表示，而不考虑允许显著特征的注意力。此外，大多数现有的方法对翻译错误进行建模，但忽略了句子语义与视觉内容之间的相关性。为了解决这些问题，我们提出了一种新的端到端框架alstm，这是一种具有语义一致性的基于注意力的LSTM模型，用于将视频转换为自然句子。该框架将注意力机制与LSTM相结合，捕捉视频的显著结构，探索多模态表示之间的相关性，生成语义内容丰富的句子。更具体地说，我们首先提出了一种使用局部二维卷积神经网络(CNN)和三维卷积神经网络表示的动态加权和的注意力机制。然后，LSTM解码器使用时间$t$的这些视觉特征和时间$t$-$1$的单词嵌入特征来生成重要单词。最后，我们利用多模态嵌入将视觉特征和句子特征映射到一个联合空间中，以保证句子描述和视频视觉内容的语义一致性。在基准数据集上的实验表明，我们的方法比BLEU和METEOR中最先进的视频字幕基线更优越。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 24th ACM international conference on Multimedia

自引率

0.00%

发文量