{"title":"An Efficient Hardware Accelerator for Sparse Convolutional Neural Networks on FPGAs","authors":"Liqiang Lu, Jiaming Xie, Ruirui Huang, Jiansong Zhang, Wei Lin, Yun Liang","doi":"10.1109/FCCM.2019.00013","DOIUrl":null,"url":null,"abstract":"Deep convolutional neural networks (CNN) have achieved remarkable performance with the cost of huge computation. As the CNN model becomes more complex and deeper, compressing CNN to sparse by pruning the redundant connection in networks has emerged as an attractive approach to reduce the amount of computation and memory requirement. In recent years, FPGAs have been demonstrated to be an effective hardware platform to accelerate CNN inference. However, most existing FPGA architectures focus on dense CNN models. The architecture designed for dense CNN models are inefficient when executing sparse models as most of the arithmetic operations involve addition and multiplication with zero operands. On the other hand, recent sparse FPGA accelerators only focus on FC layers. In this work, we aim to develop an FPGA accelerator for sparse CNNs. To efficiently deal with the irregular connection in the sparse convolutional layer, we propose a weight-oriented dataflow that processes each weight individually. Then we design an FPGA architecture which can handle input-weight connection and weight-output connection efficiently. For input-weight connection, we design a tile look-up table to eliminate the runtime indexing match of compressed weights. Moreover, we develop a weight layout to enable high on-chip memory access. To cooperate with the weight layout, a channel multiplexer is inserted to locate the address which can ensure no data access conflict. Experiments demonstrate that our accelerator can achieve 223.4-309.0 GOP/s for the modern CNNs on Xilinx ZCU102, which provides a 3.6x-12.9x speedup over previous dense CNN FPGA accelerators.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"88","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FCCM.2019.00013","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 88
Abstract
Deep convolutional neural networks (CNN) have achieved remarkable performance with the cost of huge computation. As the CNN model becomes more complex and deeper, compressing CNN to sparse by pruning the redundant connection in networks has emerged as an attractive approach to reduce the amount of computation and memory requirement. In recent years, FPGAs have been demonstrated to be an effective hardware platform to accelerate CNN inference. However, most existing FPGA architectures focus on dense CNN models. The architecture designed for dense CNN models are inefficient when executing sparse models as most of the arithmetic operations involve addition and multiplication with zero operands. On the other hand, recent sparse FPGA accelerators only focus on FC layers. In this work, we aim to develop an FPGA accelerator for sparse CNNs. To efficiently deal with the irregular connection in the sparse convolutional layer, we propose a weight-oriented dataflow that processes each weight individually. Then we design an FPGA architecture which can handle input-weight connection and weight-output connection efficiently. For input-weight connection, we design a tile look-up table to eliminate the runtime indexing match of compressed weights. Moreover, we develop a weight layout to enable high on-chip memory access. To cooperate with the weight layout, a channel multiplexer is inserted to locate the address which can ensure no data access conflict. Experiments demonstrate that our accelerator can achieve 223.4-309.0 GOP/s for the modern CNNs on Xilinx ZCU102, which provides a 3.6x-12.9x speedup over previous dense CNN FPGA accelerators.