L-STAP: Learned Spatio-Temporal Adaptive Pooling for Video Captioning

Proceedings of the 1st International Workshop on AI for Smart TV Content Production, Access and Delivery Pub Date : 2019-10-21 DOI:10.1145/3347449.3357484

Danny Francis, B. Huet

{"title":"L-STAP: Learned Spatio-Temporal Adaptive Pooling for Video Captioning","authors":"Danny Francis, B. Huet","doi":"10.1145/3347449.3357484","DOIUrl":null,"url":null,"abstract":"Automatic video captioning can be used to enrich TV programs with textual informations on scenes. These informations can be useful for visually impaired people, but can also be used to enhance indexing and research of TV records. Video captioning can be seen as being more challenging than image captioning. In both cases, we have to tackle a challenging task where a visual object has to be analyzed, and translated into a textual description in natural language. However, analyzing videos requires not only to parse still images, but also to draw correspondences through time. Recent works in video captioning have intended to deal with these issues by separating spatial and temporal analysis of videos. In this paper, we propose a Learned Spatio-Temporal Adaptive Pooling (L-STAP) method that combines spatial and temporal analysis. More specifically, we first process a video frame-by-frame through a Convolutional Neural Network. Then, instead of applying an average pooling operation to reduce dimensionality, we apply our L-STAP, which attends to specific regions in a given frame based on what appeared in previous frames. Experiments on MSVD and MSR-VTT datasets show that our method outperforms state-of-the-art methods on the video captioning task in terms of several evaluation metrics.","PeriodicalId":276496,"journal":{"name":"Proceedings of the 1st International Workshop on AI for Smart TV Content Production, Access and Delivery","volume":"39 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 1st International Workshop on AI for Smart TV Content Production, Access and Delivery","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3347449.3357484","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Automatic video captioning can be used to enrich TV programs with textual informations on scenes. These informations can be useful for visually impaired people, but can also be used to enhance indexing and research of TV records. Video captioning can be seen as being more challenging than image captioning. In both cases, we have to tackle a challenging task where a visual object has to be analyzed, and translated into a textual description in natural language. However, analyzing videos requires not only to parse still images, but also to draw correspondences through time. Recent works in video captioning have intended to deal with these issues by separating spatial and temporal analysis of videos. In this paper, we propose a Learned Spatio-Temporal Adaptive Pooling (L-STAP) method that combines spatial and temporal analysis. More specifically, we first process a video frame-by-frame through a Convolutional Neural Network. Then, instead of applying an average pooling operation to reduce dimensionality, we apply our L-STAP, which attends to specific regions in a given frame based on what appeared in previous frames. Experiments on MSVD and MSR-VTT datasets show that our method outperforms state-of-the-art methods on the video captioning task in terms of several evaluation metrics.

查看原文本刊更多论文

L-STAP:用于视频字幕的学习时空自适应池

自动视频字幕可以为电视节目提供丰富的场景文本信息。这些信息对视障人士很有用，但也可用于加强电视记录的索引和研究。视频字幕可以被视为比图像字幕更具挑战性。在这两种情况下，我们都必须解决一个具有挑战性的任务，即必须对视觉对象进行分析，并将其翻译成自然语言的文本描述。然而，分析视频不仅需要解析静止图像，还需要绘制出时间的对应关系。最近在视频字幕方面的工作旨在通过分离视频的空间和时间分析来解决这些问题。本文提出了一种结合时空分析的学习时空自适应池(L-STAP)方法。更具体地说，我们首先通过卷积神经网络逐帧处理视频。然后，我们不是应用平均池化操作来降低维数，而是应用我们的L-STAP，它根据前几帧中出现的内容关注给定帧中的特定区域。在MSVD和MSR-VTT数据集上的实验表明，就几个评估指标而言，我们的方法在视频字幕任务上优于最先进的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 1st International Workshop on AI for Smart TV Content Production, Access and Delivery

自引率

0.00%

发文量