An FPGA Design Framework for CNN Sparsification and Acceleration

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2017-04-01 DOI:10.1109/FCCM.2017.21

Sicheng Li, W. Wen, Yu Wang, Song Han, Yiran Chen, Hai Helen Li

{"title":"An FPGA Design Framework for CNN Sparsification and Acceleration","authors":"Sicheng Li, W. Wen, Yu Wang, Song Han, Yiran Chen, Hai Helen Li","doi":"10.1109/FCCM.2017.21","DOIUrl":null,"url":null,"abstract":"Convolutional neural networks (CNNs) have recently broken many performance records in image recognition and object detection problems. The success of CNNs, to a great extent, is enabled by the fast scaling-up of the networks that learn from a huge volume of data. The deployment of big CNN models can be both computation-intensive and memory-intensive, leaving severe challenges to hardware implementations. In recent years, sparsification techniques that prune redundant connections in the networks while still retaining the similar accuracy emerge as promising solutions to alliterate the computation overheads associated with CNNs [1]. However, imposing sparsity in CNNs usually generates random network connections and thus, the irregular data access pattern results in poor data locality. The low computation efficiency of the sparse networks, which is caused by the incurred unbalance in computing resource consumption and low memory bandwidth usage, significantly offsets the theocratical reduction of the computation complexity and limits the execution scalability of CNNs on general- purpose architectures [2]. For instance, as an important computation kernel in CNNs – the sparse convoluation, is usually accelerated by using data compression schemes where only nonzero elements of the kernel weights are stored and sent to multiplication-accumulation computations (MACs) at runtime. However, the relevant executions on CPUs and GPUs reach only 0.1% to 10% of the system peak performance even designated software libraries are applied (e.g., MKL library for CPUs and cuSPARSE library for GPUs). Field programmable gate arrays (FPGAs) have been also extensively studied as an important hardware platform for CNN computations [3]. Different from general-purpose architectures, FPGA allows users to customize the functions and organization of the designed hardware in order to adapt various resource needs and data usage patterns. This characteristic, as we identified in this work, can be leveraged to effectively overcome the main challenges in the execution of sparse CNNs through close coordinations between software and hardware. In particular, the reconfigurability of FPGA helps to 1) better map the sparse CNN onto the hardware for improving computation parallelism and execution efficiency and 2) eliminate the computation cost associated with zero weights and enhance data reuse to alleviate the adverse impacts of the irregular data accesses. In this work, we propose a hardware-software co-design framework to address the above challenges in sparse CNN accelerations. First, we introduce a data locality-aware sparsification scheme that optimizes the structure of the sparse CNN during training phase to make it friendly for hardware mapping. Both memory allocation and data access regularization are considered in the optimization process. Second, we develop a distributed architecture composed of the customized processing elements (PEs) that enables high computation parallelism and data reuse rate of the compressed network. Moreover, a holistic sparse optimization is introduced to our design framework for hardware platforms with different requirement. We evaluate our proposed frame- work by executing AlexNet on Xilinx Zynq ZC706. Our FPGA accelerator obtains a processing power of 71.2 GOPS, corresponding to 271.6 GOPS on the dense CNN model. On average, our FPGA design runs 11.5× faster than a well- tuned CPU implementation on Intel Xeon E5-2630, and has 3.2× better energy efficiency over the GPU realization on Nvidia Pascal Titan X. Compared to state-of-the-art FPGA designs [4], our accelerator reduces the classification time by 2.1×, with","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"21","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FCCM.2017.21","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 21

Abstract

Convolutional neural networks (CNNs) have recently broken many performance records in image recognition and object detection problems. The success of CNNs, to a great extent, is enabled by the fast scaling-up of the networks that learn from a huge volume of data. The deployment of big CNN models can be both computation-intensive and memory-intensive, leaving severe challenges to hardware implementations. In recent years, sparsification techniques that prune redundant connections in the networks while still retaining the similar accuracy emerge as promising solutions to alliterate the computation overheads associated with CNNs [1]. However, imposing sparsity in CNNs usually generates random network connections and thus, the irregular data access pattern results in poor data locality. The low computation efficiency of the sparse networks, which is caused by the incurred unbalance in computing resource consumption and low memory bandwidth usage, significantly offsets the theocratical reduction of the computation complexity and limits the execution scalability of CNNs on general- purpose architectures [2]. For instance, as an important computation kernel in CNNs – the sparse convoluation, is usually accelerated by using data compression schemes where only nonzero elements of the kernel weights are stored and sent to multiplication-accumulation computations (MACs) at runtime. However, the relevant executions on CPUs and GPUs reach only 0.1% to 10% of the system peak performance even designated software libraries are applied (e.g., MKL library for CPUs and cuSPARSE library for GPUs). Field programmable gate arrays (FPGAs) have been also extensively studied as an important hardware platform for CNN computations [3]. Different from general-purpose architectures, FPGA allows users to customize the functions and organization of the designed hardware in order to adapt various resource needs and data usage patterns. This characteristic, as we identified in this work, can be leveraged to effectively overcome the main challenges in the execution of sparse CNNs through close coordinations between software and hardware. In particular, the reconfigurability of FPGA helps to 1) better map the sparse CNN onto the hardware for improving computation parallelism and execution efficiency and 2) eliminate the computation cost associated with zero weights and enhance data reuse to alleviate the adverse impacts of the irregular data accesses. In this work, we propose a hardware-software co-design framework to address the above challenges in sparse CNN accelerations. First, we introduce a data locality-aware sparsification scheme that optimizes the structure of the sparse CNN during training phase to make it friendly for hardware mapping. Both memory allocation and data access regularization are considered in the optimization process. Second, we develop a distributed architecture composed of the customized processing elements (PEs) that enables high computation parallelism and data reuse rate of the compressed network. Moreover, a holistic sparse optimization is introduced to our design framework for hardware platforms with different requirement. We evaluate our proposed frame- work by executing AlexNet on Xilinx Zynq ZC706. Our FPGA accelerator obtains a processing power of 71.2 GOPS, corresponding to 271.6 GOPS on the dense CNN model. On average, our FPGA design runs 11.5× faster than a well- tuned CPU implementation on Intel Xeon E5-2630, and has 3.2× better energy efficiency over the GPU realization on Nvidia Pascal Titan X. Compared to state-of-the-art FPGA designs [4], our accelerator reduces the classification time by 2.1×, with

查看原文本刊更多论文

CNN稀疏化和加速的FPGA设计框架

卷积神经网络(cnn)最近在图像识别和目标检测问题上打破了许多性能记录。cnn的成功，在很大程度上是由从大量数据中学习的网络的快速扩展所实现的。大型CNN模型的部署可能是计算密集型的，也可能是内存密集型的，这给硬件实现带来了严峻的挑战。近年来，稀疏化技术在保持相似精度的同时减少了网络中的冗余连接，成为减少与cnn相关的计算开销的有希望的解决方案[1]。然而，在cnn中施加稀疏性通常会产生随机的网络连接，因此，不规则的数据访问模式导致数据局部性差。由于计算资源消耗的不平衡和较低的内存带宽使用导致稀疏网络的计算效率较低，这大大抵消了计算复杂度的显著降低，限制了cnn在通用架构上的执行可扩展性[2]。例如，作为cnn中重要的计算内核-稀疏卷积，通常通过使用数据压缩方案来加速，该方案仅存储内核权值的非零元素并在运行时发送给乘法累积计算(mac)。然而，即使使用指定的软件库(例如，用于cpu的MKL库和用于gpu的cuSPARSE库)，cpu和gpu上的相关执行也只能达到系统峰值性能的0.1%到10%。现场可编程门阵列(fpga)作为CNN计算的重要硬件平台也得到了广泛的研究[3]。与通用架构不同，FPGA允许用户自定义设计硬件的功能和组织，以适应各种资源需求和数据使用模式。正如我们在这项工作中发现的那样，可以利用这一特性，通过软件和硬件之间的密切协调，有效地克服稀疏cnn执行中的主要挑战。特别是FPGA的可重构性有助于1)更好地将稀疏CNN映射到硬件上，提高计算并行性和执行效率;2)消除零权相关的计算成本，增强数据重用，减轻不规则数据访问带来的不利影响。在这项工作中，我们提出了一个硬件软件协同设计框架来解决稀疏CNN加速中的上述挑战。首先，我们引入了一种数据位置感知的稀疏化方案，该方案在训练阶段优化了稀疏CNN的结构，使其适合硬件映射。优化过程中考虑了内存分配和数据访问正则化两个方面。其次，我们开发了一个由自定义处理元素(pe)组成的分布式架构，使压缩网络具有高计算并行性和数据重用率。此外，针对不同需求的硬件平台，在设计框架中引入了整体稀疏优化方法。我们通过在Xilinx Zynq ZC706上执行AlexNet来评估我们提出的框架。我们的FPGA加速器获得了71.2 GOPS的处理能力，对应于密集CNN模型的271.6 GOPS。平均而言，我们的FPGA设计比在Intel至强E5-2630上优化的CPU实现运行速度快11.5倍，比在Nvidia Pascal Titan x上实现的GPU节能3.2倍。与最先进的FPGA设计[4]相比，我们的加速器将分类时间减少了2.1倍，其中

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

自引率

0.00%

发文量