Joint Event Detection and Description in Continuous Video Streams

2019 IEEE Winter Applications of Computer Vision Workshops (WACVW) Pub Date : 2018-02-28 DOI:10.1109/WACV.2019.00048

Huijuan Xu, Boyang Albert Li, Vasili Ramanishka, L. Sigal, Kate Saenko

引用次数: 48

Abstract

Dense video captioning involves first localizing events in a video and then generating captions for the identified events. We present the Joint Event Detection and Description Network (JEDDi-Net) for solving this task in an end-to-end fashion, which encodes the input video stream with three-dimensional convolutional layers, proposes variable- length temporal events based on pooled features, and then uses a two-level hierarchical LSTM module with context modeling to transcribe the event proposals into captions. We show the effectiveness of our proposed JEDDi-Net on the large-scale ActivityNet Captions dataset.

查看原文本刊更多论文

连续视频流中的联合事件检测与描述

密集视频字幕包括首先将视频中的事件本地化，然后为已识别的事件生成字幕。我们提出了联合事件检测和描述网络(JEDDi-Net)以端到端方式解决该任务，该网络使用三维卷积层对输入视频流进行编码，提出基于池化特征的变长时间事件，然后使用具有上下文建模的两级分层LSTM模块将事件建议转录成字幕。我们在大规模ActivityNet Captions数据集上展示了我们提出的JEDDi-Net的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 IEEE Winter Applications of Computer Vision Workshops (WACVW)

自引率

0.00%

发文量