Survey: Transformer based video-language pre-training

IF 14.8

AI Open Pub Date : 2022-01-01 DOI:10.1016/j.aiopen.2022.01.001

Ludan Ruan, Qin Jin

引用次数: 24

Abstract

Inspired by the success of transformer-based pre-training methods on natural language tasks and further computer vision tasks, researchers have started to apply transformer to video processing. This survey aims to provide a comprehensive overview of transformer-based pre-training methods for Video-Language learning. We first briefly introduce the transformer structure as the background knowledge, including attention mechanism, position encoding etc. We then describe the typical paradigm of pre-training & fine-tuning on Video-Language processing in terms of proxy tasks, downstream tasks and commonly used video datasets. Next, we categorize transformer models into Single-Stream and Multi-Stream structures, highlight their innovations and compare their performances. Finally, we analyze and discuss the current challenges and possible future research directions for Video-Language pre-training.

查看原文本刊更多论文

基于Transformer的视频语言预训练

受基于变压器的预训练方法在自然语言任务和进一步的计算机视觉任务上的成功启发，研究人员开始将变压器应用于视频处理。这项调查的目的是提供一个全面的概述基于变换的预训练方法的视频语言学习。首先简要介绍变压器的结构作为背景知识，包括注意机制、位置编码等。然后，我们描述了预训练的典型范例。在代理任务、下游任务和常用视频数据集方面对视频语言处理进行微调。接下来，我们将变压器模型分为单流和多流结构，重点介绍了它们的创新之处并比较了它们的性能。最后，我们分析和讨论了视频语言预训练目前面临的挑战和未来可能的研究方向。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

AI Open

CiteScore

45.00

自引率

0.00%

发文量