{"title":"Dynamic N:M Fine-Grained Structured Sparse Attention Mechanism","authors":"Zhaodong Chen, Yuying Quan, Zheng Qu, L. Liu, Yufei Ding, Yuan Xie","doi":"10.1145/3572848.3577500","DOIUrl":null,"url":null,"abstract":"Transformers are becoming the mainstream solutions for various tasks like NLP and Computer vision. Despite their success, the high complexity of the attention mechanism hinders them from being applied to latency-sensitive tasks. One opportunity to accelerate the attention mechanism is leveraging the sparsity in the attention weight matrix. However, due to the dilemma between \"dynamic\" and \"fine-grained\", previous studies fail to achieve speedup on GPUs under moderate sequence lengths. They also require costly retraining to recover accuracy. In this paper, we present DFSS, the first GPU-friendly dynamic fine-grained pruning mechanism, to address this dilemma. DFSS dynamically prunes the full attention score matrix to N:M fine-grained structured sparse pattern. Our key insight is that on the dynamic side, N:M sparsity is friendly to pruning and encoding the sparse matrix on GPU. On the fine-grained side, it always preserves the dominant entries in each row. We develop a dynamic sampled dense-dense matrix multiplication kernel, first of its kind, that multiplies the query and key matrices, prunes the result, and encodes the compressed sparse matrix without overhead. Compared with previous studies, DFSS achieves speedup in arbitrary sequence lengths. It only takes a few fine-tuning epochs to reach on-par accuracy with full attention mechanism. We provide both theoretical and empirical evidence to demonstrate DFSS is a good approximation of the full attention mechanism. We evaluate the 1:2 and 2:4 sparsity under different settings and achieve 1.38 ~ 1.86× speedups over the full-attention on A100 GPU. On tasks from various domains with sequence lengths from 384 to 4096, its accuracy is on par with the full attention after only a couple of finetuning epochs from the dense pre-trained model.","PeriodicalId":233744,"journal":{"name":"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3572848.3577500","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
Abstract
Transformers are becoming the mainstream solutions for various tasks like NLP and Computer vision. Despite their success, the high complexity of the attention mechanism hinders them from being applied to latency-sensitive tasks. One opportunity to accelerate the attention mechanism is leveraging the sparsity in the attention weight matrix. However, due to the dilemma between "dynamic" and "fine-grained", previous studies fail to achieve speedup on GPUs under moderate sequence lengths. They also require costly retraining to recover accuracy. In this paper, we present DFSS, the first GPU-friendly dynamic fine-grained pruning mechanism, to address this dilemma. DFSS dynamically prunes the full attention score matrix to N:M fine-grained structured sparse pattern. Our key insight is that on the dynamic side, N:M sparsity is friendly to pruning and encoding the sparse matrix on GPU. On the fine-grained side, it always preserves the dominant entries in each row. We develop a dynamic sampled dense-dense matrix multiplication kernel, first of its kind, that multiplies the query and key matrices, prunes the result, and encodes the compressed sparse matrix without overhead. Compared with previous studies, DFSS achieves speedup in arbitrary sequence lengths. It only takes a few fine-tuning epochs to reach on-par accuracy with full attention mechanism. We provide both theoretical and empirical evidence to demonstrate DFSS is a good approximation of the full attention mechanism. We evaluate the 1:2 and 2:4 sparsity under different settings and achieve 1.38 ~ 1.86× speedups over the full-attention on A100 GPU. On tasks from various domains with sequence lengths from 384 to 4096, its accuracy is on par with the full attention after only a couple of finetuning epochs from the dense pre-trained model.