Trajectory Alignment based Multi-Scaled Temporal Attention for Efficient Video Transformer

Zao Zhang, Dong Yuan, Yu Zhang, Wei Bao
{"title":"Trajectory Alignment based Multi-Scaled Temporal Attention for Efficient Video Transformer","authors":"Zao Zhang, Dong Yuan, Yu Zhang, Wei Bao","doi":"10.1109/ICME55011.2023.00244","DOIUrl":null,"url":null,"abstract":"Although the video transformer gets remarkable accuracy on video recognition tasks, it is hard to be deployed in resource-constrained scenarios due to the high computational cost. A method that dynamically modifies and trains the transformer model, ensuring that the computational cost matches the deployment scenario requirement, would be an effective solution to this challenge. In this paper, we propose a method for modifying large-scale video transformers with trajectory alignment based multi-scaled temporal attention (TAMS) schemes to reduce the computational cost significantly while losing accuracy slightly. In the temporal dimension, we adopt multi-scaled sparsity patterns in hierarchical transformer blocks. In the spatial dimension, we use region selection to force the transformer to focus on high-importance regions while not corrupting the spatial context. Our method reduces up to 40% computational cost of state-of-the-art large-scale video transformers with a slight accuracy drop (~ 7%) on the video recognition task.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE International Conference on Multimedia and Expo (ICME)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICME55011.2023.00244","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Although the video transformer gets remarkable accuracy on video recognition tasks, it is hard to be deployed in resource-constrained scenarios due to the high computational cost. A method that dynamically modifies and trains the transformer model, ensuring that the computational cost matches the deployment scenario requirement, would be an effective solution to this challenge. In this paper, we propose a method for modifying large-scale video transformers with trajectory alignment based multi-scaled temporal attention (TAMS) schemes to reduce the computational cost significantly while losing accuracy slightly. In the temporal dimension, we adopt multi-scaled sparsity patterns in hierarchical transformer blocks. In the spatial dimension, we use region selection to force the transformer to focus on high-importance regions while not corrupting the spatial context. Our method reduces up to 40% computational cost of state-of-the-art large-scale video transformers with a slight accuracy drop (~ 7%) on the video recognition task.
基于多尺度时间关注的高效视频变压器轨迹对准
尽管视频转换器在视频识别任务中获得了显著的准确率,但由于计算成本高,难以在资源受限的场景中部署。动态修改和训练变压器模型的方法,确保计算成本与部署场景需求相匹配,将是解决这一挑战的有效方法。在本文中,我们提出了一种基于轨迹对准的多尺度时间注意(TAMS)方案来修改大型视频变压器的方法,在略微损失精度的同时显著降低了计算成本。在时间维度上,我们在分层变压器块中采用多尺度稀疏模式。在空间维度中,我们使用区域选择来强制转换器关注高度重要的区域,同时不破坏空间上下文。我们的方法将最先进的大型视频变压器的计算成本降低了40%,并且在视频识别任务上的精度略有下降(约7%)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信