Accelerate CNN Models via Filter Pruning and Sparse Tensor Core

2021 Ninth International Symposium on Computing and Networking (CANDAR) Pub Date : 2021-11-01 DOI:10.1109/CANDAR53791.2021.00009

andXurong Chen, Pangfeng Liu, Ding-Yong Hong, Jan-Jan Wu

{"title":"Accelerate CNN Models via Filter Pruning and Sparse Tensor Core","authors":"andXurong Chen, Pangfeng Liu, Ding-Yong Hong, Jan-Jan Wu","doi":"10.1109/CANDAR53791.2021.00009","DOIUrl":null,"url":null,"abstract":"Convolutional neural network (CNN) is a state-of-the-art technique in machine learning and has achieved high accuracy in many computer vision tasks. However, the number of the parameters of the models is fast increasing for accuracy improvement; therefore, it requires more computation time and memory space for training and inference. Thus, compressing the model size and improving the inference speed has become an important issue. This paper focuses on filter pruning and NVIDIA sparse tensor core. Filter pruning is one of the model compression methods which uses a method that evaluates the importance of filters in the CNN model and removes the less important ones. NVIDIA sparse tensor core is the hardware support provided by NVIDIA Ampere GPU architecture. The sparse tensor core can speed up the matrix multiplication if the matrix has a structure that manifests as a 2:4 pattern. In this paper, we proposed a hybrid pruning metric to prune the CNN model. The hybrid pruning combines filter pruning and 2:4 pruning. We apply filter pruning to remove the redundant filters in convolutional layers to make the model smaller. Next, we use 2:4 pruning to prune the model according to a 2:4 pattern to utilize the sparse tensor core hardware for speedup. In this hybrid pruning situation, we have also proposed a hybrid ranking metric to decide the filter's importance during filter pruning. In hybrid ranking metric, we will preserve the filters that are important for both of the pruning steps. By considering both metrics, we can achieve higher accuracy than traditional filter prunings. We test our hybrid pruning algorithm on MNIST, SVHN, CIFAR-10 datasets using AlexNet. From our experiments, we concluded that our hybrid ranking method achieves better accuracy than the classic L1-norm metric and the output L1-norm metric. When we prune away 40 percent of filters in the model, our method has 2.8 %, 2.9 %, 2.7% higher accuracy than the classic L1-norm metric and the output L1-norm metric on these three datasets. Next, we evaluate the inference speed. We compare the hybrid pruning model with the models that result from either filter pruning or 2:4 pruning. We find that a hybrid pruning model can be 1.3x faster than the filter pruning model with similar accuracy.","PeriodicalId":263773,"journal":{"name":"2021 Ninth International Symposium on Computing and Networking (CANDAR)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 Ninth International Symposium on Computing and Networking (CANDAR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CANDAR53791.2021.00009","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Convolutional neural network (CNN) is a state-of-the-art technique in machine learning and has achieved high accuracy in many computer vision tasks. However, the number of the parameters of the models is fast increasing for accuracy improvement; therefore, it requires more computation time and memory space for training and inference. Thus, compressing the model size and improving the inference speed has become an important issue. This paper focuses on filter pruning and NVIDIA sparse tensor core. Filter pruning is one of the model compression methods which uses a method that evaluates the importance of filters in the CNN model and removes the less important ones. NVIDIA sparse tensor core is the hardware support provided by NVIDIA Ampere GPU architecture. The sparse tensor core can speed up the matrix multiplication if the matrix has a structure that manifests as a 2:4 pattern. In this paper, we proposed a hybrid pruning metric to prune the CNN model. The hybrid pruning combines filter pruning and 2:4 pruning. We apply filter pruning to remove the redundant filters in convolutional layers to make the model smaller. Next, we use 2:4 pruning to prune the model according to a 2:4 pattern to utilize the sparse tensor core hardware for speedup. In this hybrid pruning situation, we have also proposed a hybrid ranking metric to decide the filter's importance during filter pruning. In hybrid ranking metric, we will preserve the filters that are important for both of the pruning steps. By considering both metrics, we can achieve higher accuracy than traditional filter prunings. We test our hybrid pruning algorithm on MNIST, SVHN, CIFAR-10 datasets using AlexNet. From our experiments, we concluded that our hybrid ranking method achieves better accuracy than the classic L1-norm metric and the output L1-norm metric. When we prune away 40 percent of filters in the model, our method has 2.8 %, 2.9 %, 2.7% higher accuracy than the classic L1-norm metric and the output L1-norm metric on these three datasets. Next, we evaluate the inference speed. We compare the hybrid pruning model with the models that result from either filter pruning or 2:4 pruning. We find that a hybrid pruning model can be 1.3x faster than the filter pruning model with similar accuracy.

查看原文本刊更多论文

通过滤波剪枝和稀疏张量核加速CNN模型

卷积神经网络(CNN)是机器学习领域的最新技术，在许多计算机视觉任务中取得了很高的精度。然而，为了提高模型的精度，模型参数的数量正在迅速增加;因此，训练和推理需要更多的计算时间和内存空间。因此，压缩模型尺寸和提高推理速度成为一个重要的问题。本文主要研究了滤波剪枝和NVIDIA稀疏张量核。过滤器修剪是一种模型压缩方法，它使用一种方法来评估CNN模型中过滤器的重要性，并去除不太重要的过滤器。NVIDIA稀疏张量核是由NVIDIA安培GPU架构提供的硬件支持。如果矩阵具有以2:4模式表现的结构，则稀疏张量核可以加速矩阵乘法。在本文中，我们提出了一种混合修剪度量来修剪CNN模型。混合修剪结合了过滤修剪和2:4修剪。我们使用滤波剪枝去除卷积层中的冗余滤波器，使模型更小。其次，我们使用2:4剪枝，按照2:4的模式对模型进行剪枝，利用稀疏张量核心硬件进行加速。在这种混合剪枝情况下，我们还提出了一种混合排序度量来确定过滤器剪枝过程中的重要性。在混合排序度量中，我们将保留对两个修剪步骤都很重要的过滤器。通过考虑这两个指标，我们可以获得比传统过滤器修剪更高的精度。我们使用AlexNet在MNIST, SVHN, CIFAR-10数据集上测试了混合剪枝算法。实验结果表明，混合排序方法比经典的l1 -范数度量和输出l1 -范数度量具有更好的准确率。当我们去除模型中40%的过滤器时，我们的方法在这三个数据集上的准确率比经典l1 -范数度量和输出l1 -范数度量分别高出2.8%、2.9%和2.7%。接下来，我们评估推理速度。我们将混合剪枝模型与过滤剪枝和2:4剪枝的剪枝模型进行了比较。我们发现，在相同精度下，混合剪枝模型比滤波剪枝模型快1.3倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 Ninth International Symposium on Computing and Networking (CANDAR)

自引率

0.00%

发文量