基于多尺度时间关注的高效视频变压器轨迹对准

2023 IEEE International Conference on Multimedia and Expo (ICME) Pub Date : 2023-07-01 DOI:10.1109/ICME55011.2023.00244

Zao Zhang, Dong Yuan, Yu Zhang, Wei Bao

{"title":"基于多尺度时间关注的高效视频变压器轨迹对准","authors":"Zao Zhang, Dong Yuan, Yu Zhang, Wei Bao","doi":"10.1109/ICME55011.2023.00244","DOIUrl":null,"url":null,"abstract":"Although the video transformer gets remarkable accuracy on video recognition tasks, it is hard to be deployed in resource-constrained scenarios due to the high computational cost. A method that dynamically modifies and trains the transformer model, ensuring that the computational cost matches the deployment scenario requirement, would be an effective solution to this challenge. In this paper, we propose a method for modifying large-scale video transformers with trajectory alignment based multi-scaled temporal attention (TAMS) schemes to reduce the computational cost significantly while losing accuracy slightly. In the temporal dimension, we adopt multi-scaled sparsity patterns in hierarchical transformer blocks. In the spatial dimension, we use region selection to force the transformer to focus on high-importance regions while not corrupting the spatial context. Our method reduces up to 40% computational cost of state-of-the-art large-scale video transformers with a slight accuracy drop (~ 7%) on the video recognition task.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Trajectory Alignment based Multi-Scaled Temporal Attention for Efficient Video Transformer\",\"authors\":\"Zao Zhang, Dong Yuan, Yu Zhang, Wei Bao\",\"doi\":\"10.1109/ICME55011.2023.00244\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Although the video transformer gets remarkable accuracy on video recognition tasks, it is hard to be deployed in resource-constrained scenarios due to the high computational cost. A method that dynamically modifies and trains the transformer model, ensuring that the computational cost matches the deployment scenario requirement, would be an effective solution to this challenge. In this paper, we propose a method for modifying large-scale video transformers with trajectory alignment based multi-scaled temporal attention (TAMS) schemes to reduce the computational cost significantly while losing accuracy slightly. In the temporal dimension, we adopt multi-scaled sparsity patterns in hierarchical transformer blocks. In the spatial dimension, we use region selection to force the transformer to focus on high-importance regions while not corrupting the spatial context. Our method reduces up to 40% computational cost of state-of-the-art large-scale video transformers with a slight accuracy drop (~ 7%) on the video recognition task.\",\"PeriodicalId\":321830,\"journal\":{\"name\":\"2023 IEEE International Conference on Multimedia and Expo (ICME)\",\"volume\":\"5 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 IEEE International Conference on Multimedia and Expo (ICME)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICME55011.2023.00244\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE International Conference on Multimedia and Expo (ICME)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICME55011.2023.00244","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

尽管视频转换器在视频识别任务中获得了显著的准确率，但由于计算成本高，难以在资源受限的场景中部署。动态修改和训练变压器模型的方法，确保计算成本与部署场景需求相匹配，将是解决这一挑战的有效方法。在本文中，我们提出了一种基于轨迹对准的多尺度时间注意(TAMS)方案来修改大型视频变压器的方法，在略微损失精度的同时显著降低了计算成本。在时间维度上，我们在分层变压器块中采用多尺度稀疏模式。在空间维度中，我们使用区域选择来强制转换器关注高度重要的区域，同时不破坏空间上下文。我们的方法将最先进的大型视频变压器的计算成本降低了40%，并且在视频识别任务上的精度略有下降(约7%)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Trajectory Alignment based Multi-Scaled Temporal Attention for Efficient Video Transformer

Although the video transformer gets remarkable accuracy on video recognition tasks, it is hard to be deployed in resource-constrained scenarios due to the high computational cost. A method that dynamically modifies and trains the transformer model, ensuring that the computational cost matches the deployment scenario requirement, would be an effective solution to this challenge. In this paper, we propose a method for modifying large-scale video transformers with trajectory alignment based multi-scaled temporal attention (TAMS) schemes to reduce the computational cost significantly while losing accuracy slightly. In the temporal dimension, we adopt multi-scaled sparsity patterns in hierarchical transformer blocks. In the spatial dimension, we use region selection to force the transformer to focus on high-importance regions while not corrupting the spatial context. Our method reduces up to 40% computational cost of state-of-the-art large-scale video transformers with a slight accuracy drop (~ 7%) on the video recognition task.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2023 IEEE International Conference on Multimedia and Expo (ICME)

自引率

0.00%

发文量