SPLAT: A framework for optimised GPU code-generation for SParse reguLar ATtention

Ahan Gupta, Yueming Yuan, Devansh Jain, Yuhao Ge, David Aponte, Yanqi Zhou, Charith Mendis
{"title":"SPLAT: A framework for optimised GPU code-generation for SParse reguLar ATtention","authors":"Ahan Gupta, Yueming Yuan, Devansh Jain, Yuhao Ge, David Aponte, Yanqi Zhou, Charith Mendis","doi":"arxiv-2407.16847","DOIUrl":null,"url":null,"abstract":"Multi-head-self-attention (MHSA) mechanisms achieve state-of-the-art (SOTA)\nperformance across natural language processing and vision tasks. However, their\nquadratic dependence on sequence lengths has bottlenecked inference speeds. To\ncircumvent this bottleneck, researchers have proposed various sparse-MHSA\nmodels, where a subset of full attention is computed. Despite their promise,\ncurrent sparse libraries and compilers do not support high-performance\nimplementations for diverse sparse-MHSA patterns due to the underlying sparse\nformats they operate on. These formats, which are typically designed for\nhigh-performance & scientific computing applications, are either curated for\nextreme amounts of random sparsity (<1% non-zero values), or specific sparsity\npatterns. However, the sparsity patterns in sparse-MHSA are moderately sparse\n(10-50% non-zero values) and varied, resulting in existing sparse-formats\ntrading off generality for performance. We bridge this gap, achieving both generality and performance, by proposing a\nnovel sparse format: affine-compressed-sparse-row (ACSR) and supporting\ncode-generation scheme, SPLAT, that generates high-performance implementations\nfor diverse sparse-MHSA patterns on GPUs. Core to our proposed format and code\ngeneration algorithm is the observation that common sparse-MHSA patterns have\nuniquely regular geometric properties. These properties, which can be analyzed\njust-in-time, expose novel optimizations and tiling strategies that SPLAT\nexploits to generate high-performance implementations for diverse patterns. To\ndemonstrate SPLAT's efficacy, we use it to generate code for various\nsparse-MHSA models, achieving geomean speedups of 2.05x and 4.05x over\nhand-written kernels written in triton and TVM respectively on A100 GPUs.\nMoreover, its interfaces are intuitive and easy to use with existing\nimplementations of MHSA in JAX.","PeriodicalId":501197,"journal":{"name":"arXiv - CS - Programming Languages","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Programming Languages","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.16847","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Multi-head-self-attention (MHSA) mechanisms achieve state-of-the-art (SOTA) performance across natural language processing and vision tasks. However, their quadratic dependence on sequence lengths has bottlenecked inference speeds. To circumvent this bottleneck, researchers have proposed various sparse-MHSA models, where a subset of full attention is computed. Despite their promise, current sparse libraries and compilers do not support high-performance implementations for diverse sparse-MHSA patterns due to the underlying sparse formats they operate on. These formats, which are typically designed for high-performance & scientific computing applications, are either curated for extreme amounts of random sparsity (<1% non-zero values), or specific sparsity patterns. However, the sparsity patterns in sparse-MHSA are moderately sparse (10-50% non-zero values) and varied, resulting in existing sparse-formats trading off generality for performance. We bridge this gap, achieving both generality and performance, by proposing a novel sparse format: affine-compressed-sparse-row (ACSR) and supporting code-generation scheme, SPLAT, that generates high-performance implementations for diverse sparse-MHSA patterns on GPUs. Core to our proposed format and code generation algorithm is the observation that common sparse-MHSA patterns have uniquely regular geometric properties. These properties, which can be analyzed just-in-time, expose novel optimizations and tiling strategies that SPLAT exploits to generate high-performance implementations for diverse patterns. To demonstrate SPLAT's efficacy, we use it to generate code for various sparse-MHSA models, achieving geomean speedups of 2.05x and 4.05x over hand-written kernels written in triton and TVM respectively on A100 GPUs. Moreover, its interfaces are intuitive and easy to use with existing implementations of MHSA in JAX.
SPLAT:优化 GPU 代码生成以实现稀疏重组的框架
多头自我注意(MHSA)机制在自然语言处理和视觉任务中实现了最先进的(SOTA)性能。然而,它们对序列长度的二次依赖性制约了推理速度。为了规避这一瓶颈,研究人员提出了各种稀疏-MHSA 模型,即计算全注意力的子集。尽管目前的稀疏库和编译器很有前途,但由于其运行的底层稀疏格式,它们并不支持各种稀疏-MHSA 模式的高性能实现。这些格式通常是为高性能计算和科学计算应用而设计的,要么是经过精心策划的大量随机稀疏性(<1% 非零值),要么是特定的稀疏性模式。然而,稀疏-MHSA 中的稀疏模式是中度稀疏(10%-50% 非零值)和多样的,导致现有的稀疏格式为了性能而牺牲了通用性。我们提出了一种新的稀疏格式:仿射压缩稀疏行(ACSR)和配套的代码生成方案 SPLAT,可以在 GPU 上生成各种稀疏-MHSA 模式的高性能实现,从而弥补了这一差距,实现了通用性和性能的双赢。我们提出的格式和代码生成算法的核心是观察到常见的稀疏-MHSA 图案具有独特的规则几何特性。这些特性可以及时分析,揭示出新颖的优化和平铺策略,SPLAT利用这些策略为各种图案生成高性能的实现。为了证明 SPLAT 的功效,我们用它来生成各种解析-MHSA 模型的代码,在 A100 GPU 上比用 triton 和 TVM 编写的内核分别提高了 2.05 倍和 4.05 倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信