Reducing shared memory footprint to leverage high throughput on Tensor Cores and its flexible API extension library

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2023-02-27 DOI:10.1145/3578178.3578238

Hiroyuki Ootomo, Rio Yokota

{"title":"Reducing shared memory footprint to leverage high throughput on Tensor Cores and its flexible API extension library","authors":"Hiroyuki Ootomo, Rio Yokota","doi":"10.1145/3578178.3578238","DOIUrl":null,"url":null,"abstract":"Matrix-matrix multiplication is used for various linear algebra algorithms such as matrix decomposition and tensor contraction. NVIDIA Tensor Core is a mixed-precision matrix-matrix multiplication and addition computing unit, where the theoretical peak performance is more than 300 TFlop/s on NVIDIA A100 GPU. NVIDIA provides WMMA API for using Tensor Cores in custom kernel functions. The most common way to use Tensor Core is to supply the input matrices from shared memory, which has higher bandwidth than global memory. However, the Bytes-per-Flops (B/F) ratio of the shared memory and Tensor Cores is small since the performance of Tensor Cores is high. Thus, it is important to reduce the shared memory footprint for efficient Tensor Cores usage. In this paper, we analyze the simple matrix-matrix multiplication on Tensor Cores by the roofline model and figure out that the bandwidth of shared memory might be a limitation of the performance when using WMMA API. To alleviate this issue, we provide a WMMA API extension library to boost the throughput of the computation, which has two components. The first one allows for manipulating the array of registers input to Tensor Cores flexibly. We evaluate the performance improvement of this library. The outcome of our evaluation shows that our library reduces the shared memory footprint and speeds up the computation using Tensor Cores. The second one is an API for the SGEMM emulation on Tensor Cores without additional shared memory usage. We have demonstrated that the single-precision emulating batch SGEMM implementation on Tensor Cores using this library achieves 54.2 TFlop/s on A100 GPU, which outperforms the theoretical peak performance of FP32 SIMT Cores while achieving the same level of accuracy as cuBLAS. The achieved throughput can not be achieved without reducing the shared memory footprint done by our library with the same amount of register usage.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3578178.3578238","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Matrix-matrix multiplication is used for various linear algebra algorithms such as matrix decomposition and tensor contraction. NVIDIA Tensor Core is a mixed-precision matrix-matrix multiplication and addition computing unit, where the theoretical peak performance is more than 300 TFlop/s on NVIDIA A100 GPU. NVIDIA provides WMMA API for using Tensor Cores in custom kernel functions. The most common way to use Tensor Core is to supply the input matrices from shared memory, which has higher bandwidth than global memory. However, the Bytes-per-Flops (B/F) ratio of the shared memory and Tensor Cores is small since the performance of Tensor Cores is high. Thus, it is important to reduce the shared memory footprint for efficient Tensor Cores usage. In this paper, we analyze the simple matrix-matrix multiplication on Tensor Cores by the roofline model and figure out that the bandwidth of shared memory might be a limitation of the performance when using WMMA API. To alleviate this issue, we provide a WMMA API extension library to boost the throughput of the computation, which has two components. The first one allows for manipulating the array of registers input to Tensor Cores flexibly. We evaluate the performance improvement of this library. The outcome of our evaluation shows that our library reduces the shared memory footprint and speeds up the computation using Tensor Cores. The second one is an API for the SGEMM emulation on Tensor Cores without additional shared memory usage. We have demonstrated that the single-precision emulating batch SGEMM implementation on Tensor Cores using this library achieves 54.2 TFlop/s on A100 GPU, which outperforms the theoretical peak performance of FP32 SIMT Cores while achieving the same level of accuracy as cuBLAS. The achieved throughput can not be achieved without reducing the shared memory footprint done by our library with the same amount of register usage.

查看原文本刊更多论文

减少共享内存占用，以利用张量内核的高吞吐量及其灵活的API扩展库

矩阵-矩阵乘法用于各种线性代数算法，如矩阵分解和张量收缩。NVIDIA Tensor Core是一种混合精度矩阵-矩阵乘法和加法计算单元，在NVIDIA A100 GPU上的理论峰值性能超过300 TFlop/s。NVIDIA提供WMMA API，用于在自定义内核函数中使用张量内核。使用Tensor Core最常见的方式是从共享内存中提供输入矩阵，共享内存的带宽比全局内存高。但是，由于Tensor Cores的性能高，共享内存和Tensor Cores的B/F比很小。因此，减少共享内存占用对于有效的Tensor Cores使用是很重要的。本文利用rooline模型对张量核上的简单矩阵-矩阵乘法进行了分析，指出共享内存的带宽可能是使用WMMA API时性能的一个限制因素。为了缓解这个问题，我们提供了一个WMMA API扩展库来提高计算的吞吐量，它有两个组件。第一个允许灵活地操作输入到Tensor Cores的寄存器数组。我们对这个库的性能改进进行了评估。我们的评估结果表明，我们的库减少了共享内存占用，并加快了使用Tensor Cores的计算速度。第二个是用于张量核上的SGEMM仿真的API，没有额外的共享内存使用。我们已经证明，使用该库在张量核上实现的单精度模拟批量SGEMM在A100 GPU上实现了54.2 TFlop/s，优于FP32 SIMT核的理论峰值性能，同时达到与cuBLAS相同的精度水平。在寄存器使用量相同的情况下，如果不减少我们的库完成的共享内存占用，就无法实现吞吐量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region

自引率

0.00%

发文量