A Cost-Effective CNN Accelerator Design with Configurable PU on FPGA

2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI) Pub Date : 2019-07-01 DOI:10.1109/ISVLSI.2019.00015

Chi Fung Brian Fong, Jiandong Mu, Wei Zhang

{"title":"A Cost-Effective CNN Accelerator Design with Configurable PU on FPGA","authors":"Chi Fung Brian Fong, Jiandong Mu, Wei Zhang","doi":"10.1109/ISVLSI.2019.00015","DOIUrl":null,"url":null,"abstract":"Convolutional neural networks (CNNs) are rapidly expanding and being applied to a vast range of applications. Despite its popularity, deploying CNNs on a portable system is challenging due to enormous data volume, intensive computation, and frequent memory access. Hence, many approaches have been proposed to reduce the CNN model complexity, such as model pruning and quantization. However, it also brings new challenges. For example, existing designs usually adopted channel dimension tiling which requires regular channel number. After pruning, the channel number may become highly irregular which will incur heavy zero padding and large resource waste. As for quantization, simple aggressive bit reduction usually results in large accuracy drop. In order to address these challenges, in this work, firstly we propose to use row-based tiling in the kernel dimension to adapt to different kernel sizes and channel numbers and significantly reduce the zero padding. Moreover, we developed the configurable processing units (PUs) design which can be dynamically grouped or split to support the tiling flexibility and enable efficient hardware resource sharing. As for quantization, we considered the recently proposed Incremental Network Quantization (INQ) algorithm which uses low bit representation of weights in power of 2 format, and hence is able to represent the weights with minimum computing complexity since expensive multiplication can be transferred into cheap shift operation. We further propose an approximate shifter based processing element (PE) design as the fundamental building block of the PUs to facilitate the convolution computation. At last, a case study of RTL-level implementation of INQ quantized AlexNet is realized on a standalone FPGA, Stratix V. Compared with the state-of-art designs, our accelerator achieves 1.87x higher performance, which demonstrates the efficiency of the proposed design methods.","PeriodicalId":6703,"journal":{"name":"2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"23 1","pages":"31-36"},"PeriodicalIF":0.0000,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISVLSI.2019.00015","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

Convolutional neural networks (CNNs) are rapidly expanding and being applied to a vast range of applications. Despite its popularity, deploying CNNs on a portable system is challenging due to enormous data volume, intensive computation, and frequent memory access. Hence, many approaches have been proposed to reduce the CNN model complexity, such as model pruning and quantization. However, it also brings new challenges. For example, existing designs usually adopted channel dimension tiling which requires regular channel number. After pruning, the channel number may become highly irregular which will incur heavy zero padding and large resource waste. As for quantization, simple aggressive bit reduction usually results in large accuracy drop. In order to address these challenges, in this work, firstly we propose to use row-based tiling in the kernel dimension to adapt to different kernel sizes and channel numbers and significantly reduce the zero padding. Moreover, we developed the configurable processing units (PUs) design which can be dynamically grouped or split to support the tiling flexibility and enable efficient hardware resource sharing. As for quantization, we considered the recently proposed Incremental Network Quantization (INQ) algorithm which uses low bit representation of weights in power of 2 format, and hence is able to represent the weights with minimum computing complexity since expensive multiplication can be transferred into cheap shift operation. We further propose an approximate shifter based processing element (PE) design as the fundamental building block of the PUs to facilitate the convolution computation. At last, a case study of RTL-level implementation of INQ quantized AlexNet is realized on a standalone FPGA, Stratix V. Compared with the state-of-art designs, our accelerator achieves 1.87x higher performance, which demonstrates the efficiency of the proposed design methods.

查看原文本刊更多论文

FPGA上可配置PU的高性价比CNN加速器设计

卷积神经网络(Convolutional neural networks, cnn)正在迅速发展并被广泛应用。尽管cnn很受欢迎，但由于庞大的数据量、密集的计算和频繁的内存访问，在便携式系统上部署cnn是具有挑战性的。因此，人们提出了许多降低CNN模型复杂性的方法，如模型修剪和量化。然而，它也带来了新的挑战。例如，现有的设计通常采用通道尺寸平铺，这需要固定的通道编号。修剪后的通道数可能会变得非常不规则，这将导致大量的零填充和大量的资源浪费。在量化方面，简单的主动降位通常会导致较大的精度下降。为了解决这些问题，在这项工作中，我们首先提出在内核维度上使用基于行的平铺来适应不同的内核大小和通道数，并显著减少零填充。此外，我们还开发了可配置处理单元(pu)设计，可以动态分组或拆分，以支持平铺灵活性和实现有效的硬件资源共享。在量化方面，我们考虑了最近提出的增量网络量化(INQ)算法，该算法使用2次幂格式的低比特表示权重，因此能够以最小的计算复杂度表示权重，因为昂贵的乘法可以转换为便宜的移位操作。我们进一步提出了一种基于近似移位器的处理单元(PE)设计作为处理器的基本构建块，以促进卷积计算。最后，在独立的FPGA Stratix v上实现了INQ量化AlexNet的rtl级实现，与目前的设计相比，我们的加速器的性能提高了1.87倍，证明了所提出的设计方法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)

自引率

0.00%

发文量