Recovering single precision accuracy from Tensor Cores while surpassing the FP32 theoretical peak performance

IF 2.5 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

International Journal of High Performance Computing Applications Pub Date : 2022-03-07 DOI:10.1177/10943420221090256

Hiroyuki Ootomo, Rio Yokota

{"title":"Recovering single precision accuracy from Tensor Cores while surpassing the FP32 theoretical peak performance","authors":"Hiroyuki Ootomo, Rio Yokota","doi":"10.1177/10943420221090256","DOIUrl":null,"url":null,"abstract":"Tensor Core is a mixed-precision matrix–matrix multiplication unit on NVIDIA GPUs with a theoretical peak performance of more than 300 TFlop/s on Ampere architectures. Tensor Cores were developed in response to the high demand of dense matrix multiplication from machine learning. However, many applications in scientific computing such as preconditioners for iterative solvers and low-precision Fourier transforms can exploit these Tensor Cores. To compute a matrix multiplication on Tensor Cores, we need to convert input matrices to half-precision, which results in loss of accuracy. To avoid this, we can keep the mantissa loss in the conversion using additional half-precision variables and use them for correcting the accuracy of matrix–matrix multiplication. Even with this correction, the use of Tensor Cores yields higher throughput compared to FP32 SIMT Cores. Nevertheless, the correcting capability of this method alone is limited, and the resulting accuracy cannot match that of a matrix multiplication on FP32 SIMT Cores. We address this problem and develop a high accuracy, high performance, and low power consumption matrix–matrix multiplication implementation using Tensor Cores, which exactly matches the accuracy of FP32 SIMT Cores while achieving superior throughput. The implementation is based on NVIDIA’s CUTLASS. We found that the key to achieving this accuracy is how to deal with the rounding inside Tensor Cores and underflow probability during the correction computation. Our implementation achieves 51 TFlop/s for a limited exponent range using FP16 Tensor Cores and 33 TFlop/s for full exponent range of FP32 using TF32 Tensor Cores on NVIDIA A100 GPUs, which outperforms the theoretical FP32 SIMT Core peak performance of 19.5 TFlop/s.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"475 - 491"},"PeriodicalIF":2.5000,"publicationDate":"2022-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of High Performance Computing Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1177/10943420221090256","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 15

Abstract

Tensor Core is a mixed-precision matrix–matrix multiplication unit on NVIDIA GPUs with a theoretical peak performance of more than 300 TFlop/s on Ampere architectures. Tensor Cores were developed in response to the high demand of dense matrix multiplication from machine learning. However, many applications in scientific computing such as preconditioners for iterative solvers and low-precision Fourier transforms can exploit these Tensor Cores. To compute a matrix multiplication on Tensor Cores, we need to convert input matrices to half-precision, which results in loss of accuracy. To avoid this, we can keep the mantissa loss in the conversion using additional half-precision variables and use them for correcting the accuracy of matrix–matrix multiplication. Even with this correction, the use of Tensor Cores yields higher throughput compared to FP32 SIMT Cores. Nevertheless, the correcting capability of this method alone is limited, and the resulting accuracy cannot match that of a matrix multiplication on FP32 SIMT Cores. We address this problem and develop a high accuracy, high performance, and low power consumption matrix–matrix multiplication implementation using Tensor Cores, which exactly matches the accuracy of FP32 SIMT Cores while achieving superior throughput. The implementation is based on NVIDIA’s CUTLASS. We found that the key to achieving this accuracy is how to deal with the rounding inside Tensor Cores and underflow probability during the correction computation. Our implementation achieves 51 TFlop/s for a limited exponent range using FP16 Tensor Cores and 33 TFlop/s for full exponent range of FP32 using TF32 Tensor Cores on NVIDIA A100 GPUs, which outperforms the theoretical FP32 SIMT Core peak performance of 19.5 TFlop/s.

查看原文本刊更多论文

在超越FP32理论峰值性能的同时，从张量核心恢复单精度精度

Tensor Core是NVIDIA GPU上的混合精度矩阵-矩阵乘法单元，在Ampere架构上的理论峰值性能超过300 TFlop/s。张量核是为了响应机器学习对密集矩阵乘法的高需求而开发的。然而，科学计算中的许多应用，如迭代求解器的预处理器和低精度傅立叶变换，都可以利用这些张量核。要在张量核上计算矩阵乘法，我们需要将输入矩阵转换为半精度，这会导致精度损失。为了避免这种情况，我们可以使用额外的半精度变量来保持转换中的尾数损失，并使用它们来校正矩阵-矩阵乘法的精度。即使进行了这种校正，与FP32 SIMT核心相比，张量核心的使用也会产生更高的吞吐量。然而，仅此方法的校正能力是有限的，并且由此产生的精度不能与FP32 SIMT核心上的矩阵乘法的精度相匹配。我们解决了这个问题，并开发了一种使用张量核心的高精度、高性能和低功耗矩阵-矩阵乘法实现，它与FP32 SIMT核心的精度完全匹配，同时实现了卓越的吞吐量。该实现基于NVIDIA的CUTRASS。我们发现，实现这种精度的关键是如何在校正计算过程中处理张量核内的舍入和下溢概率。我们的实现在NVIDIA A100 GPU上使用FP16张量核在有限指数范围内实现了51TFlop/s，使用TF32张量核在FP32的全指数范围内达到了33TFlop/s，这优于理论上的FP32 SIMT核19.5TFlop/s。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of High Performance Computing Applications 工程技术-计算机：跨学科应用

CiteScore

6.10

自引率

6.50%

发文量

审稿时长

>12 weeks

期刊介绍： With ever increasing pressure for health services in all countries to meet rising demands, improve their quality and efficiency, and to be more accountable; the need for rigorous research and policy analysis has never been greater. The Journal of Health Services Research & Policy presents the latest scientific research, insightful overviews and reflections on underlying issues, and innovative, thought provoking contributions from leading academics and policy-makers. It provides ideas and hope for solving dilemmas that confront all countries.