SPLAT: A framework for optimised GPU code-generation for SParse reguLar ATtention

arXiv - CS - Programming Languages Pub Date : 2024-07-23 DOI:arxiv-2407.16847

Ahan Gupta, Yueming Yuan, Devansh Jain, Yuhao Ge, David Aponte, Yanqi Zhou, Charith Mendis

{"title":"SPLAT: A framework for optimised GPU code-generation for SParse reguLar ATtention","authors":"Ahan Gupta, Yueming Yuan, Devansh Jain, Yuhao Ge, David Aponte, Yanqi Zhou, Charith Mendis","doi":"arxiv-2407.16847","DOIUrl":null,"url":null,"abstract":"Multi-head-self-attention (MHSA) mechanisms achieve state-of-the-art (SOTA)\nperformance across natural language processing and vision tasks. However, their\nquadratic dependence on sequence lengths has bottlenecked inference speeds. To\ncircumvent this bottleneck, researchers have proposed various sparse-MHSA\nmodels, where a subset of full attention is computed. Despite their promise,\ncurrent sparse libraries and compilers do not support high-performance\nimplementations for diverse sparse-MHSA patterns due to the underlying sparse\nformats they operate on. These formats, which are typically designed for\nhigh-performance & scientific computing applications, are either curated for\nextreme amounts of random sparsity (<1% non-zero values), or specific sparsity\npatterns. However, the sparsity patterns in sparse-MHSA are moderately sparse\n(10-50% non-zero values) and varied, resulting in existing sparse-formats\ntrading off generality for performance. We bridge this gap, achieving both generality and performance, by proposing a\nnovel sparse format: affine-compressed-sparse-row (ACSR) and supporting\ncode-generation scheme, SPLAT, that generates high-performance implementations\nfor diverse sparse-MHSA patterns on GPUs. Core to our proposed format and code\ngeneration algorithm is the observation that common sparse-MHSA patterns have\nuniquely regular geometric properties. These properties, which can be analyzed\njust-in-time, expose novel optimizations and tiling strategies that SPLAT\nexploits to generate high-performance implementations for diverse patterns. To\ndemonstrate SPLAT's efficacy, we use it to generate code for various\nsparse-MHSA models, achieving geomean speedups of 2.05x and 4.05x over\nhand-written kernels written in triton and TVM respectively on A100 GPUs.\nMoreover, its interfaces are intuitive and easy to use with existing\nimplementations of MHSA in JAX.","PeriodicalId":501197,"journal":{"name":"arXiv - CS - Programming Languages","volume":"141 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Programming Languages","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.16847","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Multi-head-self-attention (MHSA) mechanisms achieve state-of-the-art (SOTA) performance across natural language processing and vision tasks. However, their quadratic dependence on sequence lengths has bottlenecked inference speeds. To circumvent this bottleneck, researchers have proposed various sparse-MHSA models, where a subset of full attention is computed. Despite their promise, current sparse libraries and compilers do not support high-performance implementations for diverse sparse-MHSA patterns due to the underlying sparse formats they operate on. These formats, which are typically designed for high-performance & scientific computing applications, are either curated for extreme amounts of random sparsity (<1% non-zero values), or specific sparsity patterns. However, the sparsity patterns in sparse-MHSA are moderately sparse (10-50% non-zero values) and varied, resulting in existing sparse-formats trading off generality for performance. We bridge this gap, achieving both generality and performance, by proposing a novel sparse format: affine-compressed-sparse-row (ACSR) and supporting code-generation scheme, SPLAT, that generates high-performance implementations for diverse sparse-MHSA patterns on GPUs. Core to our proposed format and code generation algorithm is the observation that common sparse-MHSA patterns have uniquely regular geometric properties. These properties, which can be analyzed just-in-time, expose novel optimizations and tiling strategies that SPLAT exploits to generate high-performance implementations for diverse patterns. To demonstrate SPLAT's efficacy, we use it to generate code for various sparse-MHSA models, achieving geomean speedups of 2.05x and 4.05x over hand-written kernels written in triton and TVM respectively on A100 GPUs. Moreover, its interfaces are intuitive and easy to use with existing implementations of MHSA in JAX.

查看原文本刊更多论文

SPLAT：优化 GPU 代码生成以实现稀疏重组的框架

多头自我注意（MHSA）机制在自然语言处理和视觉任务中实现了最先进的（SOTA）性能。然而，它们对序列长度的二次依赖性制约了推理速度。为了规避这一瓶颈，研究人员提出了各种稀疏-MHSA 模型，即计算全注意力的子集。尽管目前的稀疏库和编译器很有前途，但由于其运行的底层稀疏格式，它们并不支持各种稀疏-MHSA 模式的高性能实现。这些格式通常是为高性能计算和科学计算应用而设计的，要么是经过精心策划的大量随机稀疏性（<1% 非零值），要么是特定的稀疏性模式。然而，稀疏-MHSA 中的稀疏模式是中度稀疏（10%-50% 非零值）和多样的，导致现有的稀疏格式为了性能而牺牲了通用性。我们提出了一种新的稀疏格式：仿射压缩稀疏行（ACSR）和配套的代码生成方案 SPLAT，可以在 GPU 上生成各种稀疏-MHSA 模式的高性能实现，从而弥补了这一差距，实现了通用性和性能的双赢。我们提出的格式和代码生成算法的核心是观察到常见的稀疏-MHSA 图案具有独特的规则几何特性。这些特性可以及时分析，揭示出新颖的优化和平铺策略，SPLAT利用这些策略为各种图案生成高性能的实现。为了证明 SPLAT 的功效，我们用它来生成各种解析-MHSA 模型的代码，在 A100 GPU 上比用 triton 和 TVM 编写的内核分别提高了 2.05 倍和 4.05 倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Programming Languages

自引率

0.00%

发文量