Dynamic N:M Fine-Grained Structured Sparse Attention Mechanism

Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming Pub Date : 2022-02-28 DOI:10.1145/3572848.3577500

Zhaodong Chen, Yuying Quan, Zheng Qu, L. Liu, Yufei Ding, Yuan Xie

{"title":"Dynamic N:M Fine-Grained Structured Sparse Attention Mechanism","authors":"Zhaodong Chen, Yuying Quan, Zheng Qu, L. Liu, Yufei Ding, Yuan Xie","doi":"10.1145/3572848.3577500","DOIUrl":null,"url":null,"abstract":"Transformers are becoming the mainstream solutions for various tasks like NLP and Computer vision. Despite their success, the high complexity of the attention mechanism hinders them from being applied to latency-sensitive tasks. One opportunity to accelerate the attention mechanism is leveraging the sparsity in the attention weight matrix. However, due to the dilemma between \"dynamic\" and \"fine-grained\", previous studies fail to achieve speedup on GPUs under moderate sequence lengths. They also require costly retraining to recover accuracy. In this paper, we present DFSS, the first GPU-friendly dynamic fine-grained pruning mechanism, to address this dilemma. DFSS dynamically prunes the full attention score matrix to N:M fine-grained structured sparse pattern. Our key insight is that on the dynamic side, N:M sparsity is friendly to pruning and encoding the sparse matrix on GPU. On the fine-grained side, it always preserves the dominant entries in each row. We develop a dynamic sampled dense-dense matrix multiplication kernel, first of its kind, that multiplies the query and key matrices, prunes the result, and encodes the compressed sparse matrix without overhead. Compared with previous studies, DFSS achieves speedup in arbitrary sequence lengths. It only takes a few fine-tuning epochs to reach on-par accuracy with full attention mechanism. We provide both theoretical and empirical evidence to demonstrate DFSS is a good approximation of the full attention mechanism. We evaluate the 1:2 and 2:4 sparsity under different settings and achieve 1.38 ~ 1.86× speedups over the full-attention on A100 GPU. On tasks from various domains with sequence lengths from 384 to 4096, its accuracy is on par with the full attention after only a couple of finetuning epochs from the dense pre-trained model.","PeriodicalId":233744,"journal":{"name":"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3572848.3577500","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Transformers are becoming the mainstream solutions for various tasks like NLP and Computer vision. Despite their success, the high complexity of the attention mechanism hinders them from being applied to latency-sensitive tasks. One opportunity to accelerate the attention mechanism is leveraging the sparsity in the attention weight matrix. However, due to the dilemma between "dynamic" and "fine-grained", previous studies fail to achieve speedup on GPUs under moderate sequence lengths. They also require costly retraining to recover accuracy. In this paper, we present DFSS, the first GPU-friendly dynamic fine-grained pruning mechanism, to address this dilemma. DFSS dynamically prunes the full attention score matrix to N:M fine-grained structured sparse pattern. Our key insight is that on the dynamic side, N:M sparsity is friendly to pruning and encoding the sparse matrix on GPU. On the fine-grained side, it always preserves the dominant entries in each row. We develop a dynamic sampled dense-dense matrix multiplication kernel, first of its kind, that multiplies the query and key matrices, prunes the result, and encodes the compressed sparse matrix without overhead. Compared with previous studies, DFSS achieves speedup in arbitrary sequence lengths. It only takes a few fine-tuning epochs to reach on-par accuracy with full attention mechanism. We provide both theoretical and empirical evidence to demonstrate DFSS is a good approximation of the full attention mechanism. We evaluate the 1:2 and 2:4 sparsity under different settings and achieve 1.38 ~ 1.86× speedups over the full-attention on A100 GPU. On tasks from various domains with sequence lengths from 384 to 4096, its accuracy is on par with the full attention after only a couple of finetuning epochs from the dense pre-trained model.

查看原文本刊更多论文

动态N:M细粒度结构稀疏注意机制

变形金刚正在成为NLP和计算机视觉等各种任务的主流解决方案。尽管它们取得了成功，但注意机制的高度复杂性阻碍了它们被应用于对延迟敏感的任务。加速注意机制的一个机会是利用注意权重矩阵的稀疏性。然而，由于“动态”和“细粒度”的两难，以往的研究未能在中等序列长度的gpu上实现加速。他们还需要昂贵的再培训来恢复准确性。在本文中，我们提出了DFSS，第一个gpu友好的动态细粒度修剪机制，来解决这个难题。DFSS动态地将完整的注意力得分矩阵修剪为N:M的细粒度结构化稀疏模式。我们的关键见解是，在动态方面，N:M稀疏性对GPU上的稀疏矩阵的修剪和编码是友好的。在细粒度方面，它总是保留每行中的主要条目。我们开发了一个动态采样稠密矩阵乘法核，这是同类中的第一个，它可以在没有开销的情况下对查询和键矩阵进行乘法，对结果进行修剪，并对压缩的稀疏矩阵进行编码。与以往的研究相比，DFSS实现了任意序列长度的加速。它只需要几个微调时代达到与全注意力机制相当的精度。我们提供了理论和经验证据来证明DFSS是一个很好的近似全注意机制。我们在不同的设置下评估了1:2和2:4的稀疏度，在A100 GPU上实现了1.38 ~ 1.86倍的全注意力加速。在序列长度从384到4096的多个领域的任务上，仅经过密集预训练模型的几次微调，其准确率就与全注意力相当。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

自引率

0.00%

发文量