基于Transformer的视频语言预训练

Ludan Ruan, Qin Jin
{"title":"基于Transformer的视频语言预训练","authors":"Ludan Ruan,&nbsp;Qin Jin","doi":"10.1016/j.aiopen.2022.01.001","DOIUrl":null,"url":null,"abstract":"<div><p>Inspired by the success of transformer-based pre-training methods on natural language tasks and further computer vision tasks, researchers have started to apply transformer to video processing. This survey aims to provide a comprehensive overview of transformer-based pre-training methods for Video-Language learning. We first briefly introduce the transformer structure as the background knowledge, including attention mechanism, position encoding etc. We then describe the typical paradigm of pre-training &amp; fine-tuning on Video-Language processing in terms of proxy tasks, downstream tasks and commonly used video datasets. Next, we categorize transformer models into Single-Stream and Multi-Stream structures, highlight their innovations and compare their performances. Finally, we analyze and discuss the current challenges and possible future research directions for Video-Language pre-training.</p></div>","PeriodicalId":100068,"journal":{"name":"AI Open","volume":"3 ","pages":"Pages 1-13"},"PeriodicalIF":0.0000,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2666651022000018/pdfft?md5=d7b4ae16eb4b58434223ebe8ccf64030&pid=1-s2.0-S2666651022000018-main.pdf","citationCount":"24","resultStr":"{\"title\":\"Survey: Transformer based video-language pre-training\",\"authors\":\"Ludan Ruan,&nbsp;Qin Jin\",\"doi\":\"10.1016/j.aiopen.2022.01.001\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Inspired by the success of transformer-based pre-training methods on natural language tasks and further computer vision tasks, researchers have started to apply transformer to video processing. This survey aims to provide a comprehensive overview of transformer-based pre-training methods for Video-Language learning. We first briefly introduce the transformer structure as the background knowledge, including attention mechanism, position encoding etc. We then describe the typical paradigm of pre-training &amp; fine-tuning on Video-Language processing in terms of proxy tasks, downstream tasks and commonly used video datasets. Next, we categorize transformer models into Single-Stream and Multi-Stream structures, highlight their innovations and compare their performances. Finally, we analyze and discuss the current challenges and possible future research directions for Video-Language pre-training.</p></div>\",\"PeriodicalId\":100068,\"journal\":{\"name\":\"AI Open\",\"volume\":\"3 \",\"pages\":\"Pages 1-13\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S2666651022000018/pdfft?md5=d7b4ae16eb4b58434223ebe8ccf64030&pid=1-s2.0-S2666651022000018-main.pdf\",\"citationCount\":\"24\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"AI Open\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2666651022000018\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"AI Open","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666651022000018","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 24

摘要

受基于变压器的预训练方法在自然语言任务和进一步的计算机视觉任务上的成功启发,研究人员开始将变压器应用于视频处理。这项调查的目的是提供一个全面的概述基于变换的预训练方法的视频语言学习。首先简要介绍变压器的结构作为背景知识,包括注意机制、位置编码等。然后,我们描述了预训练的典型范例。在代理任务、下游任务和常用视频数据集方面对视频语言处理进行微调。接下来,我们将变压器模型分为单流和多流结构,重点介绍了它们的创新之处并比较了它们的性能。最后,我们分析和讨论了视频语言预训练目前面临的挑战和未来可能的研究方向。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Survey: Transformer based video-language pre-training

Inspired by the success of transformer-based pre-training methods on natural language tasks and further computer vision tasks, researchers have started to apply transformer to video processing. This survey aims to provide a comprehensive overview of transformer-based pre-training methods for Video-Language learning. We first briefly introduce the transformer structure as the background knowledge, including attention mechanism, position encoding etc. We then describe the typical paradigm of pre-training & fine-tuning on Video-Language processing in terms of proxy tasks, downstream tasks and commonly used video datasets. Next, we categorize transformer models into Single-Stream and Multi-Stream structures, highlight their innovations and compare their performances. Finally, we analyze and discuss the current challenges and possible future research directions for Video-Language pre-training.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
45.00
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信