RTSA: A Run-Through Sparse Attention Framework for Video Transformer

IF 3.6 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers Pub Date : 2025-03-03 DOI:10.1109/TC.2025.3547139

Xuhang Wang;Zhuoran Song;Chunyu Qi;Fangxin Liu;Naifeng Jing;Li Jiang;Xiaoyao Liang

{"title":"RTSA: A Run-Through Sparse Attention Framework for Video Transformer","authors":"Xuhang Wang;Zhuoran Song;Chunyu Qi;Fangxin Liu;Naifeng Jing;Li Jiang;Xiaoyao Liang","doi":"10.1109/TC.2025.3547139","DOIUrl":null,"url":null,"abstract":"In the realm of video understanding tasks, Video Transformer models (VidT) have recently exhibited impressive accuracy improvements in numerous edge devices. However, their deployment poses significant computational challenges for hardware. To address this, pruning has emerged as a promising approach to reduce computation and memory requirements by eliminating unimportant elements from the attention matrix. Unfortunately, existing pruning algorithms face a limitation in that they only optimize one of the two key modules on VidT's critical path: linear projection or self-attention. Regrettably, due to the variation in battery power in edge devices, the video resolution they generate will also change, which causes both linear projection and self-attention stages to potentially become bottlenecks, the existing approaches lack generality. Accordingly, we establish a Run-Through Sparse Attention (RTSA) framework that simultaneously sparsifies and accelerates two stages. On the algorithm side, unlike current methodologies conducting sparse linear projection by exploring redundancy within each frame, we extract extra redundancy naturally existing between frames. Moreover, for sparse self-attention, as existing pruning algorithms often provide either too coarse-grained or fine-grained sparsity patterns, these algorithms face limitations in simultaneously achieving high sparsity, low accuracy loss, and high speedup, resulting in either compromised accuracy or reduced efficiency. Thus, we prune the attention matrix at a medium granularity—sub-vector. The sub-vectors are generated by isolating each column of the attention matrix. On the hardware side, we observe that the use of distinct computational units for sparse linear projection and self-attention results in pipeline imbalances because of the bottleneck transformation between the two stages. To effectively eliminate pipeline stall, we design a RTSA architecture that supports sequential execution of both sparse linear projection and self-attention. To achieve this, we devised an atomic vector-scalar product computation underpinning all calculations in parse linear projection and self-attention, as well as evolving a spatial array architecture with augmented processing elements (PEs) tailored for the vector-scalar product. Experiments on VidT models show that RTSA can save 2.71<inline-formula><tex-math>$\\boldsymbol{\\times}$</tex-math></inline-formula> to 5.32<inline-formula><tex-math>$\\boldsymbol{\\times}$</tex-math></inline-formula> ideal computation with <inline-formula><tex-math>$ \\lt 1\\%$</tex-math></inline-formula> accuracy loss, achieving 105<inline-formula><tex-math>$\\boldsymbol{\\times}$</tex-math></inline-formula>, 56.8<inline-formula><tex-math>$\\boldsymbol{\\times}$</tex-math></inline-formula>, 3.59<inline-formula><tex-math>$\\boldsymbol{\\times}$</tex-math></inline-formula>, and 3.31<inline-formula><tex-math>$\\boldsymbol{\\times}$</tex-math></inline-formula> speedup compared to CPU, GPU, as well as the state-of-the-art ViT accelerators ViTCoD and HeatViT.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 6","pages":"1949-1962"},"PeriodicalIF":3.6000,"publicationDate":"2025-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computers","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10909307/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

In the realm of video understanding tasks, Video Transformer models (VidT) have recently exhibited impressive accuracy improvements in numerous edge devices. However, their deployment poses significant computational challenges for hardware. To address this, pruning has emerged as a promising approach to reduce computation and memory requirements by eliminating unimportant elements from the attention matrix. Unfortunately, existing pruning algorithms face a limitation in that they only optimize one of the two key modules on VidT's critical path: linear projection or self-attention. Regrettably, due to the variation in battery power in edge devices, the video resolution they generate will also change, which causes both linear projection and self-attention stages to potentially become bottlenecks, the existing approaches lack generality. Accordingly, we establish a Run-Through Sparse Attention (RTSA) framework that simultaneously sparsifies and accelerates two stages. On the algorithm side, unlike current methodologies conducting sparse linear projection by exploring redundancy within each frame, we extract extra redundancy naturally existing between frames. Moreover, for sparse self-attention, as existing pruning algorithms often provide either too coarse-grained or fine-grained sparsity patterns, these algorithms face limitations in simultaneously achieving high sparsity, low accuracy loss, and high speedup, resulting in either compromised accuracy or reduced efficiency. Thus, we prune the attention matrix at a medium granularity—sub-vector. The sub-vectors are generated by isolating each column of the attention matrix. On the hardware side, we observe that the use of distinct computational units for sparse linear projection and self-attention results in pipeline imbalances because of the bottleneck transformation between the two stages. To effectively eliminate pipeline stall, we design a RTSA architecture that supports sequential execution of both sparse linear projection and self-attention. To achieve this, we devised an atomic vector-scalar product computation underpinning all calculations in parse linear projection and self-attention, as well as evolving a spatial array architecture with augmented processing elements (PEs) tailored for the vector-scalar product. Experiments on VidT models show that RTSA can save 2.71

$\boldsymbol{\times}$

to 5.32

$\boldsymbol{\times}$

ideal computation with

$ \lt 1\%$

accuracy loss, achieving 105

$\boldsymbol{\times}$

, 56.8

$\boldsymbol{\times}$

, 3.59

$\boldsymbol{\times}$

, and 3.31

$\boldsymbol{\times}$

speedup compared to CPU, GPU, as well as the state-of-the-art ViT accelerators ViTCoD and HeatViT.

查看原文本刊更多论文

RTSA：一种用于视频转换器的贯穿式稀疏注意力框架

在视频理解任务领域，视频变压器模型（VidT）最近在许多边缘设备中展示了令人印象深刻的精度改进。然而，它们的部署给硬件带来了巨大的计算挑战。为了解决这个问题，修剪已经成为一种很有前途的方法，通过从注意力矩阵中消除不重要的元素来减少计算和内存需求。不幸的是，现有的剪枝算法面临着一个限制，即它们只优化了VidT关键路径上的两个关键模块中的一个：线性投影或自关注。遗憾的是，由于边缘设备中电池电量的变化，它们生成的视频分辨率也会发生变化，这使得线性投影和自关注阶段都可能成为瓶颈，现有方法缺乏通用性。因此，我们建立了一个运行贯穿稀疏注意（RTSA）框架，同时简化和加速两个阶段。在算法方面，不像目前的方法通过探索每帧内的冗余来进行稀疏线性投影，我们提取帧之间自然存在的额外冗余。此外，对于稀疏自关注，由于现有的剪枝算法通常提供过于粗粒度或细粒度的稀疏模式，这些算法在同时实现高稀疏性、低精度损失和高加速方面存在局限性，从而导致准确性受损或效率降低。因此，我们在中等粒度-子向量上修剪注意力矩阵。子向量是通过分离注意力矩阵的每一列来生成的。在硬件方面，我们观察到，由于两个阶段之间的瓶颈转换，使用不同的计算单元进行稀疏线性投影和自关注会导致管道不平衡。为了有效地消除管道失速，我们设计了一个支持稀疏线性投影和自关注顺序执行的RTSA架构。为了实现这一点，我们设计了一个原子向量-标量乘积计算，支持解析线性投影和自关注中的所有计算，以及发展一个空间阵列架构，其中包含为向量-标量乘积量身定制的增强处理元素（pe）。在VidT模型上的实验表明，RTSA可以节省2.71$\boldsymbol{\times}$至5.32$\boldsymbol{\times}$的理想计算，精度损失$\ lt $ 1\%$，与CPU， GPU以及最先进的ViT加速器ViTCoD和HeatViT相比，实现105$\boldsymbol{\times}$, 56.8$\boldsymbol{\times}$, 3.59$\boldsymbol{\times}$和3.31$\boldsymbol{\times}$的加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Computers 工程技术-工程：电子与电气

CiteScore

6.60

自引率

5.40%

发文量

199

审稿时长

6.0 months

期刊介绍： The IEEE Transactions on Computers is a monthly publication with a wide distribution to researchers, developers, technical managers, and educators in the computer field. It publishes papers on research in areas of current interest to the readers. These areas include, but are not limited to, the following: a) computer organizations and architectures; b) operating systems, software systems, and communication protocols; c) real-time systems and embedded systems; d) digital devices, computer components, and interconnection networks; e) specification, design, prototyping, and testing methods and tools; f) performance, fault tolerance, reliability, security, and testability; g) case studies and experimental and theoretical evaluations; and h) new and important applications and trends.