Zhao Guo, Lianli Gao, Jingkuan Song, Xing Xu, Jie Shao, Heng Tao Shen
{"title":"Attention-based LSTM with Semantic Consistency for Videos Captioning","authors":"Zhao Guo, Lianli Gao, Jingkuan Song, Xing Xu, Jie Shao, Heng Tao Shen","doi":"10.1145/2964284.2967242","DOIUrl":null,"url":null,"abstract":"Recent progress in using Long Short-Term Memory (LSTM) for image description has motivated the exploration of their applications for automatically describing video content with natural language sentences. By taking a video as a sequence of features, LSTM model is trained on video-sentence pairs to learn association of a video to a sentence. However, most existing methods compress an entire video shot or frame into a static representation, without considering attention which allows for salient features. Furthermore, most existing approaches model the translating error, but ignore the correlations between sentence semantics and visual content. To tackle these issues, we propose a novel end-to-end framework named aLSTMs, an attention-based LSTM model with semantic consistency, to transfer videos to natural sentences. This framework integrates attention mechanism with LSTM to capture salient structures of video, and explores the correlation between multi-modal representations for generating sentences with rich semantic content. More specifically, we first propose an attention mechanism which uses the dynamic weighted sum of local 2D Convolutional Neural Network (CNN) and 3D CNN representations. Then, a LSTM decoder takes these visual features at time $t$ and the word-embedding feature at time $t$-$1$ to generate important words. Finally, we uses multi-modal embedding to map the visual and sentence features into a joint space to guarantee the semantic consistence of the sentence description and the video visual content. Experiments on the benchmark datasets demonstrate the superiority of our method than the state-of-the-art baselines for video captioning in both BLEU and METEOR.","PeriodicalId":140670,"journal":{"name":"Proceedings of the 24th ACM international conference on Multimedia","volume":"88 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"56","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 24th ACM international conference on Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2964284.2967242","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 56
Abstract
Recent progress in using Long Short-Term Memory (LSTM) for image description has motivated the exploration of their applications for automatically describing video content with natural language sentences. By taking a video as a sequence of features, LSTM model is trained on video-sentence pairs to learn association of a video to a sentence. However, most existing methods compress an entire video shot or frame into a static representation, without considering attention which allows for salient features. Furthermore, most existing approaches model the translating error, but ignore the correlations between sentence semantics and visual content. To tackle these issues, we propose a novel end-to-end framework named aLSTMs, an attention-based LSTM model with semantic consistency, to transfer videos to natural sentences. This framework integrates attention mechanism with LSTM to capture salient structures of video, and explores the correlation between multi-modal representations for generating sentences with rich semantic content. More specifically, we first propose an attention mechanism which uses the dynamic weighted sum of local 2D Convolutional Neural Network (CNN) and 3D CNN representations. Then, a LSTM decoder takes these visual features at time $t$ and the word-embedding feature at time $t$-$1$ to generate important words. Finally, we uses multi-modal embedding to map the visual and sentence features into a joint space to guarantee the semantic consistence of the sentence description and the video visual content. Experiments on the benchmark datasets demonstrate the superiority of our method than the state-of-the-art baselines for video captioning in both BLEU and METEOR.