Towards Fast GPU-based Sparse DNN Inference: A Hybrid Compute Model

2022 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2022-09-19 DOI:10.1109/HPEC55821.2022.9926290

Shaoxian Xu, Minkang Wu, Long Zheng, Zhiyuan Shao, Xiangyu Ye, Xiaofei Liao, Hai Jin

{"title":"Towards Fast GPU-based Sparse DNN Inference: A Hybrid Compute Model","authors":"Shaoxian Xu, Minkang Wu, Long Zheng, Zhiyuan Shao, Xiangyu Ye, Xiaofei Liao, Hai Jin","doi":"10.1109/HPEC55821.2022.9926290","DOIUrl":null,"url":null,"abstract":"As the model scale of Deep Neural Networks (DNNs) increases, the memory and computational cost of DNNs become overwhelmingly large. Sparse Deep Neural Networks (SpDNNs) are promising to cope with this challenge by using fewer weights while preserving the accuracy. However, the sparsity nature of SpDNN models makes it difficult to run efficiently on GPUs. To stimulate technical advances for improving the efficiency of SpDNN inference, the MIT/IEEE/Amazon GraphChallenge proposes the SpDNN Challenge in 2019. In this paper, we present a hybrid compute model to improve the efficiency of Sparse Matrix Multiplications (SpMMs), the core computation of SpDNN inference. First, the given sparse weight matrix will be divided to generate many (sparse and dense) submatrices. For sparse submatrices, we leverage compile-time data embedding to compile the sparse data together with their corresponding computations into instructions and hence the number of random accesses can be reduced significantly. For dense submatrices, we follow the traditional computing mode where the data is obtained from the memory to exploit the high memory bandwidth of GPU. This hybrid compute model effectively balances the memory and instruction bottlenecks, and offers more scheduling opportunities to overlap computing operations and memory accesses on GPU. To determine whether a sub matrix is sparse, we present a cost model to estimate its time cost under the traditional computing mode and the data-embedded computing mode in an accurate and efficient manner. Once the computing mode for all submatrices is determined, customized codes will be generated for the SpDNN inference. Experimental results on the SpDNN Challenge benchmarks show that our approach achieves up to 197.86 tera-edges per second inference throughput on a single NVIDIA A100 GPU. Compared to the 2021 and 2020 champions, our approach offers up to 6.37x and 89.94x speedups on a single GPU, respectively. We also implement a 16-GPU version, showing up to 9.49x and 80.11x speedups over the former 16-GPU baselines of the 2021 and 2020 champions.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPEC55821.2022.9926290","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

As the model scale of Deep Neural Networks (DNNs) increases, the memory and computational cost of DNNs become overwhelmingly large. Sparse Deep Neural Networks (SpDNNs) are promising to cope with this challenge by using fewer weights while preserving the accuracy. However, the sparsity nature of SpDNN models makes it difficult to run efficiently on GPUs. To stimulate technical advances for improving the efficiency of SpDNN inference, the MIT/IEEE/Amazon GraphChallenge proposes the SpDNN Challenge in 2019. In this paper, we present a hybrid compute model to improve the efficiency of Sparse Matrix Multiplications (SpMMs), the core computation of SpDNN inference. First, the given sparse weight matrix will be divided to generate many (sparse and dense) submatrices. For sparse submatrices, we leverage compile-time data embedding to compile the sparse data together with their corresponding computations into instructions and hence the number of random accesses can be reduced significantly. For dense submatrices, we follow the traditional computing mode where the data is obtained from the memory to exploit the high memory bandwidth of GPU. This hybrid compute model effectively balances the memory and instruction bottlenecks, and offers more scheduling opportunities to overlap computing operations and memory accesses on GPU. To determine whether a sub matrix is sparse, we present a cost model to estimate its time cost under the traditional computing mode and the data-embedded computing mode in an accurate and efficient manner. Once the computing mode for all submatrices is determined, customized codes will be generated for the SpDNN inference. Experimental results on the SpDNN Challenge benchmarks show that our approach achieves up to 197.86 tera-edges per second inference throughput on a single NVIDIA A100 GPU. Compared to the 2021 and 2020 champions, our approach offers up to 6.37x and 89.94x speedups on a single GPU, respectively. We also implement a 16-GPU version, showing up to 9.49x and 80.11x speedups over the former 16-GPU baselines of the 2021 and 2020 champions.

查看原文本刊更多论文

基于gpu的稀疏DNN快速推理:一种混合计算模型

随着深度神经网络模型规模的不断扩大，其内存和计算成本也越来越大。稀疏深度神经网络(spdnn)有望通过使用更少的权重来应对这一挑战，同时保持准确性。然而，SpDNN模型的稀疏性使得它很难在gpu上高效运行。为了刺激提高SpDNN推理效率的技术进步，MIT/IEEE/Amazon GraphChallenge在2019年提出了SpDNN挑战。为了提高SpDNN推理的核心计算——稀疏矩阵乘法(spmm)的效率，本文提出了一种混合计算模型。首先，将给定的稀疏权矩阵划分为多个(稀疏和密集)子矩阵。对于稀疏子矩阵，我们利用编译时数据嵌入将稀疏数据与其相应的计算一起编译成指令，从而可以显着减少随机访问的次数。对于密集子矩阵，我们采用传统的从内存中获取数据的计算模式，以利用GPU的高内存带宽。这种混合计算模型有效地平衡了内存和指令瓶颈，并为GPU上的计算操作和内存访问重叠提供了更多的调度机会。为了确定子矩阵是否稀疏，我们提出了一个代价模型，在传统计算模式和数据嵌入计算模式下准确有效地估计子矩阵的时间代价。一旦确定了所有子矩阵的计算模式，就会为SpDNN推理生成自定义代码。SpDNN挑战基准测试的实验结果表明，我们的方法在单个NVIDIA A100 GPU上实现了每秒高达197.86兆边的推理吞吐量。与2021年和2020年的冠军相比，我们的方法在单个GPU上分别提供了高达6.37倍和89.94倍的加速。我们还实现了一个16 gpu版本，在2021年和2020年冠军的16 gpu基线上显示出高达9.49倍和80.11倍的加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 IEEE High Performance Extreme Computing Conference (HPEC)

自引率

0.00%

发文量