{"title":"基于多尺度时间关注的高效视频变压器轨迹对准","authors":"Zao Zhang, Dong Yuan, Yu Zhang, Wei Bao","doi":"10.1109/ICME55011.2023.00244","DOIUrl":null,"url":null,"abstract":"Although the video transformer gets remarkable accuracy on video recognition tasks, it is hard to be deployed in resource-constrained scenarios due to the high computational cost. A method that dynamically modifies and trains the transformer model, ensuring that the computational cost matches the deployment scenario requirement, would be an effective solution to this challenge. In this paper, we propose a method for modifying large-scale video transformers with trajectory alignment based multi-scaled temporal attention (TAMS) schemes to reduce the computational cost significantly while losing accuracy slightly. In the temporal dimension, we adopt multi-scaled sparsity patterns in hierarchical transformer blocks. In the spatial dimension, we use region selection to force the transformer to focus on high-importance regions while not corrupting the spatial context. Our method reduces up to 40% computational cost of state-of-the-art large-scale video transformers with a slight accuracy drop (~ 7%) on the video recognition task.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Trajectory Alignment based Multi-Scaled Temporal Attention for Efficient Video Transformer\",\"authors\":\"Zao Zhang, Dong Yuan, Yu Zhang, Wei Bao\",\"doi\":\"10.1109/ICME55011.2023.00244\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Although the video transformer gets remarkable accuracy on video recognition tasks, it is hard to be deployed in resource-constrained scenarios due to the high computational cost. A method that dynamically modifies and trains the transformer model, ensuring that the computational cost matches the deployment scenario requirement, would be an effective solution to this challenge. In this paper, we propose a method for modifying large-scale video transformers with trajectory alignment based multi-scaled temporal attention (TAMS) schemes to reduce the computational cost significantly while losing accuracy slightly. In the temporal dimension, we adopt multi-scaled sparsity patterns in hierarchical transformer blocks. In the spatial dimension, we use region selection to force the transformer to focus on high-importance regions while not corrupting the spatial context. Our method reduces up to 40% computational cost of state-of-the-art large-scale video transformers with a slight accuracy drop (~ 7%) on the video recognition task.\",\"PeriodicalId\":321830,\"journal\":{\"name\":\"2023 IEEE International Conference on Multimedia and Expo (ICME)\",\"volume\":\"5 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 IEEE International Conference on Multimedia and Expo (ICME)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICME55011.2023.00244\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE International Conference on Multimedia and Expo (ICME)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICME55011.2023.00244","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Trajectory Alignment based Multi-Scaled Temporal Attention for Efficient Video Transformer
Although the video transformer gets remarkable accuracy on video recognition tasks, it is hard to be deployed in resource-constrained scenarios due to the high computational cost. A method that dynamically modifies and trains the transformer model, ensuring that the computational cost matches the deployment scenario requirement, would be an effective solution to this challenge. In this paper, we propose a method for modifying large-scale video transformers with trajectory alignment based multi-scaled temporal attention (TAMS) schemes to reduce the computational cost significantly while losing accuracy slightly. In the temporal dimension, we adopt multi-scaled sparsity patterns in hierarchical transformer blocks. In the spatial dimension, we use region selection to force the transformer to focus on high-importance regions while not corrupting the spatial context. Our method reduces up to 40% computational cost of state-of-the-art large-scale video transformers with a slight accuracy drop (~ 7%) on the video recognition task.