Attention in SRAM on Tenstorrent Grayskull

Moritz Thüning
{"title":"Attention in SRAM on Tenstorrent Grayskull","authors":"Moritz Thüning","doi":"arxiv-2407.13885","DOIUrl":null,"url":null,"abstract":"When implementations of the Transformer's self-attention layer utilize SRAM\ninstead of DRAM, they can achieve significant speedups. The Tenstorrent\nGrayskull architecture provides a large SRAM, distributed across a grid of\ncores. This work presents a fused kernel for Grayskull, that exclusively\nutilizes its large SRAM by combining matrix multiplication, attention score\nscaling and Softmax operations. Additionally, a dedicated Softmax kernel\nutilizing the SRAM and a CPU implementation serving as a baseline are\npresented. The Softmax operation consumes most of the runtime in the\ncomputation of attention weights from queries and keys on Grayskull. The\nspeedup of the dedicated Softmax kernel compared to the CPU implementation is\nup to $10 \\times$, and the Softmax implementation inside the fused kernel is\napproximately $1.8 \\times$ faster than the dedicated Softmax kernel. The time\nand memory complexity of all implementations is quadratic in sequence length.\nCurrently, the Grayskull e150 is approximately $30 \\times$ cheaper for the\ngeneral public than an Nvidia H100 PCIe (a state-of-the-art GPU) and offers\napproximately $1.5 \\times$ more SRAM.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"2 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Performance","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.13885","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

When implementations of the Transformer's self-attention layer utilize SRAM instead of DRAM, they can achieve significant speedups. The Tenstorrent Grayskull architecture provides a large SRAM, distributed across a grid of cores. This work presents a fused kernel for Grayskull, that exclusively utilizes its large SRAM by combining matrix multiplication, attention score scaling and Softmax operations. Additionally, a dedicated Softmax kernel utilizing the SRAM and a CPU implementation serving as a baseline are presented. The Softmax operation consumes most of the runtime in the computation of attention weights from queries and keys on Grayskull. The speedup of the dedicated Softmax kernel compared to the CPU implementation is up to $10 \times$, and the Softmax implementation inside the fused kernel is approximately $1.8 \times$ faster than the dedicated Softmax kernel. The time and memory complexity of all implementations is quadratic in sequence length. Currently, the Grayskull e150 is approximately $30 \times$ cheaper for the general public than an Nvidia H100 PCIe (a state-of-the-art GPU) and offers approximately $1.5 \times$ more SRAM.
关注 Tenstorrent Grayskull 上的 SRAM
当 Transformer 的自我关注层利用 SRAM 而不是 DRAM 实现时,它们可以实现显著的提速。TenstorrentGrayskull 架构提供了一个大型 SRAM,分布在网格状的核心上。这项工作为 Grayskull 提出了一个融合内核,通过结合矩阵乘法、注意力评分和 Softmax 操作,充分利用其大型 SRAM。此外,还介绍了利用 SRAM 的专用 Softmax 内核和作为基线的 CPU 实现。在根据 Grayskull 上的查询和按键计算注意力权重时,Softmax 操作消耗了大部分运行时间。与CPU实现相比,专用Softmax内核的速度提高了10倍,而融合内核中的Softmax实现比专用Softmax内核快约1.8倍。目前,对于普通大众来说,Grayskull e150 比 Nvidia H100 PCIe(最先进的 GPU)便宜约 30 美元(times$),SRAM 也多出约 1.5 美元(times$)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信