{"title":"RTSA: A Run-Through Sparse Attention Framework for Video Transformer","authors":"Xuhang Wang;Zhuoran Song;Chunyu Qi;Fangxin Liu;Naifeng Jing;Li Jiang;Xiaoyao Liang","doi":"10.1109/TC.2025.3547139","DOIUrl":null,"url":null,"abstract":"In the realm of video understanding tasks, Video Transformer models (VidT) have recently exhibited impressive accuracy improvements in numerous edge devices. However, their deployment poses significant computational challenges for hardware. To address this, pruning has emerged as a promising approach to reduce computation and memory requirements by eliminating unimportant elements from the attention matrix. Unfortunately, existing pruning algorithms face a limitation in that they only optimize one of the two key modules on VidT's critical path: linear projection or self-attention. Regrettably, due to the variation in battery power in edge devices, the video resolution they generate will also change, which causes both linear projection and self-attention stages to potentially become bottlenecks, the existing approaches lack generality. Accordingly, we establish a Run-Through Sparse Attention (RTSA) framework that simultaneously sparsifies and accelerates two stages. On the algorithm side, unlike current methodologies conducting sparse linear projection by exploring redundancy within each frame, we extract extra redundancy naturally existing between frames. Moreover, for sparse self-attention, as existing pruning algorithms often provide either too coarse-grained or fine-grained sparsity patterns, these algorithms face limitations in simultaneously achieving high sparsity, low accuracy loss, and high speedup, resulting in either compromised accuracy or reduced efficiency. Thus, we prune the attention matrix at a medium granularity—sub-vector. The sub-vectors are generated by isolating each column of the attention matrix. On the hardware side, we observe that the use of distinct computational units for sparse linear projection and self-attention results in pipeline imbalances because of the bottleneck transformation between the two stages. To effectively eliminate pipeline stall, we design a RTSA architecture that supports sequential execution of both sparse linear projection and self-attention. To achieve this, we devised an atomic vector-scalar product computation underpinning all calculations in parse linear projection and self-attention, as well as evolving a spatial array architecture with augmented processing elements (PEs) tailored for the vector-scalar product. Experiments on VidT models show that RTSA can save 2.71<inline-formula><tex-math>$\\boldsymbol{\\times}$</tex-math></inline-formula> to 5.32<inline-formula><tex-math>$\\boldsymbol{\\times}$</tex-math></inline-formula> ideal computation with <inline-formula><tex-math>$ \\lt 1\\%$</tex-math></inline-formula> accuracy loss, achieving 105<inline-formula><tex-math>$\\boldsymbol{\\times}$</tex-math></inline-formula>, 56.8<inline-formula><tex-math>$\\boldsymbol{\\times}$</tex-math></inline-formula>, 3.59<inline-formula><tex-math>$\\boldsymbol{\\times}$</tex-math></inline-formula>, and 3.31<inline-formula><tex-math>$\\boldsymbol{\\times}$</tex-math></inline-formula> speedup compared to CPU, GPU, as well as the state-of-the-art ViT accelerators ViTCoD and HeatViT.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 6","pages":"1949-1962"},"PeriodicalIF":3.6000,"publicationDate":"2025-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computers","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10909307/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
In the realm of video understanding tasks, Video Transformer models (VidT) have recently exhibited impressive accuracy improvements in numerous edge devices. However, their deployment poses significant computational challenges for hardware. To address this, pruning has emerged as a promising approach to reduce computation and memory requirements by eliminating unimportant elements from the attention matrix. Unfortunately, existing pruning algorithms face a limitation in that they only optimize one of the two key modules on VidT's critical path: linear projection or self-attention. Regrettably, due to the variation in battery power in edge devices, the video resolution they generate will also change, which causes both linear projection and self-attention stages to potentially become bottlenecks, the existing approaches lack generality. Accordingly, we establish a Run-Through Sparse Attention (RTSA) framework that simultaneously sparsifies and accelerates two stages. On the algorithm side, unlike current methodologies conducting sparse linear projection by exploring redundancy within each frame, we extract extra redundancy naturally existing between frames. Moreover, for sparse self-attention, as existing pruning algorithms often provide either too coarse-grained or fine-grained sparsity patterns, these algorithms face limitations in simultaneously achieving high sparsity, low accuracy loss, and high speedup, resulting in either compromised accuracy or reduced efficiency. Thus, we prune the attention matrix at a medium granularity—sub-vector. The sub-vectors are generated by isolating each column of the attention matrix. On the hardware side, we observe that the use of distinct computational units for sparse linear projection and self-attention results in pipeline imbalances because of the bottleneck transformation between the two stages. To effectively eliminate pipeline stall, we design a RTSA architecture that supports sequential execution of both sparse linear projection and self-attention. To achieve this, we devised an atomic vector-scalar product computation underpinning all calculations in parse linear projection and self-attention, as well as evolving a spatial array architecture with augmented processing elements (PEs) tailored for the vector-scalar product. Experiments on VidT models show that RTSA can save 2.71$\boldsymbol{\times}$ to 5.32$\boldsymbol{\times}$ ideal computation with $ \lt 1\%$ accuracy loss, achieving 105$\boldsymbol{\times}$, 56.8$\boldsymbol{\times}$, 3.59$\boldsymbol{\times}$, and 3.31$\boldsymbol{\times}$ speedup compared to CPU, GPU, as well as the state-of-the-art ViT accelerators ViTCoD and HeatViT.
期刊介绍:
The IEEE Transactions on Computers is a monthly publication with a wide distribution to researchers, developers, technical managers, and educators in the computer field. It publishes papers on research in areas of current interest to the readers. These areas include, but are not limited to, the following: a) computer organizations and architectures; b) operating systems, software systems, and communication protocols; c) real-time systems and embedded systems; d) digital devices, computer components, and interconnection networks; e) specification, design, prototyping, and testing methods and tools; f) performance, fault tolerance, reliability, security, and testability; g) case studies and experimental and theoretical evaluations; and h) new and important applications and trends.