Attention in SRAM on Tenstorrent Grayskull

arXiv - CS - Performance Pub Date : 2024-07-18 DOI:arxiv-2407.13885

Moritz Thüning

引用次数: 0

Abstract

When implementations of the Transformer's self-attention layer utilize SRAM instead of DRAM, they can achieve significant speedups. The Tenstorrent Grayskull architecture provides a large SRAM, distributed across a grid of cores. This work presents a fused kernel for Grayskull, that exclusively utilizes its large SRAM by combining matrix multiplication, attention score scaling and Softmax operations. Additionally, a dedicated Softmax kernel utilizing the SRAM and a CPU implementation serving as a baseline are presented. The Softmax operation consumes most of the runtime in the computation of attention weights from queries and keys on Grayskull. The speedup of the dedicated Softmax kernel compared to the CPU implementation is up to $10 \times$, and the Softmax implementation inside the fused kernel is approximately $1.8 \times$ faster than the dedicated Softmax kernel. The time and memory complexity of all implementations is quadratic in sequence length. Currently, the Grayskull e150 is approximately $30 \times$ cheaper for the general public than an Nvidia H100 PCIe (a state-of-the-art GPU) and offers approximately $1.5 \times$ more SRAM.

查看原文本刊更多论文

关注 Tenstorrent Grayskull 上的 SRAM

当 Transformer 的自我关注层利用 SRAM 而不是 DRAM 实现时，它们可以实现显著的提速。TenstorrentGrayskull 架构提供了一个大型 SRAM，分布在网格状的核心上。这项工作为 Grayskull 提出了一个融合内核，通过结合矩阵乘法、注意力评分和 Softmax 操作，充分利用其大型 SRAM。此外，还介绍了利用 SRAM 的专用 Softmax 内核和作为基线的 CPU 实现。在根据 Grayskull 上的查询和按键计算注意力权重时，Softmax 操作消耗了大部分运行时间。与CPU实现相比，专用Softmax内核的速度提高了10倍，而融合内核中的Softmax实现比专用Softmax内核快约1.8倍。目前，对于普通大众来说，Grayskull e150 比 Nvidia H100 PCIe（最先进的 GPU）便宜约 30 美元（times$），SRAM 也多出约 1.5 美元（times$）。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Performance

自引率

0.00%

发文量