Performance Aware Convolutional Neural Network Channel Pruning for Embedded GPUs

2019 IEEE International Symposium on Workload Characterization (IISWC) Pub Date : 2019-08-22 DOI:10.1109/IISWC47752.2019.9042000

Valentin Radu, Kuba Kaszyk, Yuan Wen, Jack Turner, José Cano, Elliot J. Crowley, Björn Franke, A. Storkey, Michael F. P. O'Boyle

{"title":"Performance Aware Convolutional Neural Network Channel Pruning for Embedded GPUs","authors":"Valentin Radu, Kuba Kaszyk, Yuan Wen, Jack Turner, José Cano, Elliot J. Crowley, Björn Franke, A. Storkey, Michael F. P. O'Boyle","doi":"10.1109/IISWC47752.2019.9042000","DOIUrl":null,"url":null,"abstract":"Convolutional Neural Networks (CNN) are becoming a common presence in many applications and services, due to their superior recognition accuracy. They are increasingly being used on mobile devices, many times just by porting large models designed for server space, although several model compression techniques have been considered. One model compression technique intended to reduce computations is channel pruning. Mobile and embedded systems now have GPUs which are ideal for the parallel computations of neural networks and for their lower energy cost per operation. Specialized libraries perform these neural network computations through highly optimized routines. As we find in our experiments, these libraries are optimized for the most common network shapes, making uninstructed channel pruning inefficient. We evaluate higher level libraries, which analyze the input characteristics of a convolutional layer, based on which they produce optimized OpenCL (Arm Compute Library and TVM) and CUDA (cuDNN) code. However, in reality, these characteristics and subsequent choices intended for optimization can have the opposite effect. We show that a reduction in the number of convolutional channels, pruning 12% of the initial size, is in some cases detrimental to performance, leading to 2× slowdown. On the other hand, we also find examples where performance-aware pruning achieves the intended results, with performance speedups of 3× with cuDNN and above 10× with Arm Compute Library and TVM. Our findings expose the need for hardware-instructed neural network pruning.","PeriodicalId":121068,"journal":{"name":"2019 IEEE International Symposium on Workload Characterization (IISWC)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"32","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Symposium on Workload Characterization (IISWC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IISWC47752.2019.9042000","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 32

Abstract

Convolutional Neural Networks (CNN) are becoming a common presence in many applications and services, due to their superior recognition accuracy. They are increasingly being used on mobile devices, many times just by porting large models designed for server space, although several model compression techniques have been considered. One model compression technique intended to reduce computations is channel pruning. Mobile and embedded systems now have GPUs which are ideal for the parallel computations of neural networks and for their lower energy cost per operation. Specialized libraries perform these neural network computations through highly optimized routines. As we find in our experiments, these libraries are optimized for the most common network shapes, making uninstructed channel pruning inefficient. We evaluate higher level libraries, which analyze the input characteristics of a convolutional layer, based on which they produce optimized OpenCL (Arm Compute Library and TVM) and CUDA (cuDNN) code. However, in reality, these characteristics and subsequent choices intended for optimization can have the opposite effect. We show that a reduction in the number of convolutional channels, pruning 12% of the initial size, is in some cases detrimental to performance, leading to 2× slowdown. On the other hand, we also find examples where performance-aware pruning achieves the intended results, with performance speedups of 3× with cuDNN and above 10× with Arm Compute Library and TVM. Our findings expose the need for hardware-instructed neural network pruning.

查看原文本刊更多论文

基于性能感知的嵌入式gpu卷积神经网络通道修剪

卷积神经网络(CNN)由于其优越的识别精度，在许多应用程序和服务中变得越来越普遍。它们越来越多地用于移动设备，很多时候只是通过移植为服务器空间设计的大型模型，尽管已经考虑了几种模型压缩技术。一种旨在减少计算量的模型压缩技术是信道修剪。移动和嵌入式系统现在有gpu，这是神经网络并行计算的理想选择，而且每次操作的能耗更低。专门的库通过高度优化的例程执行这些神经网络计算。正如我们在实验中发现的那样，这些库针对最常见的网络形状进行了优化，使得非指示通道修剪效率低下。我们评估了更高级别的库，它们分析了卷积层的输入特征，并在此基础上生成了优化的OpenCL (Arm计算库和TVM)和CUDA (cuDNN)代码。然而，在现实中，这些特征和随后用于优化的选择可能会产生相反的效果。我们表明，减少卷积通道的数量，即减少初始大小的12%，在某些情况下对性能有害，导致2倍的减速。另一方面，我们也发现了性能感知修剪达到预期结果的例子，使用cuDNN的性能提高了3倍，使用Arm Compute Library和TVM的性能提高了10倍以上。我们的发现揭示了硬件指导神经网络修剪的必要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 IEEE International Symposium on Workload Characterization (IISWC)

自引率

0.00%

发文量