Video description method with fusion of instance-aware temporal features

International Conference on Image Processing and Intelligent Control Pub Date : 2023-08-09 DOI:10.1117/12.3000765

Ju Huang, He Yan, Lingkun Liu, Yuhan Liu

引用次数: 0

Abstract

There are still challenges in the field of video understanding today, especially how to use natural language to describe the visual content in videos. Existing video encoder-decoder models struggle to extract deep semantic information and effectively understand the complex contextual semantics in a video sequence. Furthermore, different visual elements in the video contribute differently to the generation of video text descriptions. In this paper, we propose a video description method that fuses instance-aware temporal features. We extract local features of instances on the temporal sequence to enhance perception of temporal instances. We also employ spatial attention to perform weighted fusion of temporal features. Finally, we use bidirectional long short-term memory networks to encode the contextual semantic information of the video sequence, thereby helping to generate higher quality descriptive text. Experimental results on two public datasets demonstrate that our method achieves good performance on various evaluation metrics.

查看原文本刊更多论文

融合实例感知时间特征的视频描述方法

目前，视频理解领域仍存在诸多挑战，特别是如何使用自然语言来描述视频中的视觉内容。现有的视频编码器-解码器模型难以提取深度语义信息并有效理解视频序列中复杂的上下文语义。此外，视频中不同的视觉元素对视频文本描述的生成也有不同的贡献。本文提出了一种融合实例感知时间特征的视频描述方法。我们在时间序列上提取实例的局部特征，以增强对时间实例的感知。我们还利用空间注意力对时间特征进行加权融合。最后，我们使用双向长短期记忆网络对视频序列的上下文语义信息进行编码，从而有助于生成更高质量的描述性文本。在两个公共数据集上的实验结果表明，我们的方法在各种评估指标上都取得了良好的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Conference on Image Processing and Intelligent Control

自引率

0.00%

发文量