Accelerating Sparse Deep Neural Network Inference Using GPU Tensor Cores

2022 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2022-09-19 DOI:10.1109/HPEC55821.2022.9926300

Yufei Sun, Long Zheng, Qinggang Wang, Xiangyu Ye, Yu Huang, Pengcheng Yao, Xiaofei Liao, Hai Jin

{"title":"Accelerating Sparse Deep Neural Network Inference Using GPU Tensor Cores","authors":"Yufei Sun, Long Zheng, Qinggang Wang, Xiangyu Ye, Yu Huang, Pengcheng Yao, Xiaofei Liao, Hai Jin","doi":"10.1109/HPEC55821.2022.9926300","DOIUrl":null,"url":null,"abstract":"Sparse deep neural networks (SpDNN) attract a lot of research and industry attention because of their powerful learning capability, whose execution time is dominated by the sparse matrix-dense matrix multiplication (SpMM). As one of specialized processors for matrix multiplication, NVIDIA GPU Tensor Cores can perform half-precision matrix-matrix multiplication with higher performance than CUDA Cores, which provides great op-portunities for SpMM acceleration. However, performing SpMM efficiently on Tensor Cores remains tremendously challenging. First, typical Tensor Cores do not handle extremely sparse matrix computations well, delivering much lower performance compared to the dense counterparts. Second, the single-precision Challenge dataset prevents them from leveraging powerful Tensor Cores to improve performance. To this end, we first propose a similarity-based matrix transformation scheme, which polarizes the weight matrix to be either denser or sparser in local regions. Then the denser and sparser workloads are respectively processed on Tensor Cores and CUDA Cores, boosting the overall efficiency. Second, considering the half-precision limitation of Tensor Cores, we further propose a lightweight emulation algorithm to achieve the single-precision computation on Tensor Cores without affecting the correctness of final results. To the best of our knowl-edge, this paper is the first to accelerate SpDNN inference on Tensor Cores without compromising the precision requirement. Extensive experiments validate that our work reaches up to 300 TeraEdges per second inference throughput on a single A100 GPU, yielding up to 89.41x and 8.12x speedups against the champions of the 2020 and 2021 Sparse Deep Neural Network Graph Challenge, respectively. Moreover, our 4-GPU version are also up to 6.56 x faster over the 2021 champion running on 4 GPUs and 7.55x faster over the 2020 champion running on 768 GPUs.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPEC55821.2022.9926300","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Sparse deep neural networks (SpDNN) attract a lot of research and industry attention because of their powerful learning capability, whose execution time is dominated by the sparse matrix-dense matrix multiplication (SpMM). As one of specialized processors for matrix multiplication, NVIDIA GPU Tensor Cores can perform half-precision matrix-matrix multiplication with higher performance than CUDA Cores, which provides great op-portunities for SpMM acceleration. However, performing SpMM efficiently on Tensor Cores remains tremendously challenging. First, typical Tensor Cores do not handle extremely sparse matrix computations well, delivering much lower performance compared to the dense counterparts. Second, the single-precision Challenge dataset prevents them from leveraging powerful Tensor Cores to improve performance. To this end, we first propose a similarity-based matrix transformation scheme, which polarizes the weight matrix to be either denser or sparser in local regions. Then the denser and sparser workloads are respectively processed on Tensor Cores and CUDA Cores, boosting the overall efficiency. Second, considering the half-precision limitation of Tensor Cores, we further propose a lightweight emulation algorithm to achieve the single-precision computation on Tensor Cores without affecting the correctness of final results. To the best of our knowl-edge, this paper is the first to accelerate SpDNN inference on Tensor Cores without compromising the precision requirement. Extensive experiments validate that our work reaches up to 300 TeraEdges per second inference throughput on a single A100 GPU, yielding up to 89.41x and 8.12x speedups against the champions of the 2020 and 2021 Sparse Deep Neural Network Graph Challenge, respectively. Moreover, our 4-GPU version are also up to 6.56 x faster over the 2021 champion running on 4 GPUs and 7.55x faster over the 2020 champion running on 768 GPUs.

查看原文本刊更多论文

利用GPU张量核加速稀疏深度神经网络推理

稀疏深度神经网络(SpDNN)因其强大的学习能力而受到广泛的研究和业界关注，其执行时间主要由稀疏矩阵-密集矩阵乘法(SpMM)决定。作为矩阵乘法的专用处理器之一，NVIDIA GPU Tensor Cores可以执行比CUDA Cores更高的半精度矩阵乘法，这为SpMM加速提供了很大的机会。然而，在张量核上有效地执行SpMM仍然是一个巨大的挑战。首先，典型的Tensor Cores不能很好地处理极度稀疏的矩阵计算，与密集的同类相比，提供的性能要低得多。其次，单精度挑战数据集使他们无法利用强大的张量核心来提高性能。为此，我们首先提出了一种基于相似度的矩阵变换方案，该方案在局部区域将权重矩阵极化为更密集或更稀疏。然后在张量核和CUDA核上分别处理密集和稀疏的工作负载，提高整体效率。其次，考虑到Tensor Cores的半精度限制，我们进一步提出了一种轻量级仿真算法，在不影响最终结果正确性的情况下，在Tensor Cores上实现单精度计算。据我们所知，本文是第一个在不影响精度要求的情况下在张量核上加速SpDNN推理的论文。广泛的实验验证了我们的工作在单个A100 GPU上达到每秒300 TeraEdges的推理吞吐量，与2020年和2021年稀疏深度神经网络图挑战赛的冠军相比，分别产生高达89.41倍和8.12倍的速度提升。此外，我们的4 gpu版本也比运行4 gpu的2021年冠军快6.56倍，比运行768 gpu的2020年冠军快7.55倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 IEEE High Performance Extreme Computing Conference (HPEC)

自引率

0.00%

发文量