Efficient FPGA design for Convolutions in CNN based on FFT-pruning

2020 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS) Pub Date : 2020-12-08 DOI:10.1109/APCCAS50809.2020.9301653

Liulu He, Xiaoru Xie, Jun Lin, Zhongfeng Wang

{"title":"Efficient FPGA design for Convolutions in CNN based on FFT-pruning","authors":"Liulu He, Xiaoru Xie, Jun Lin, Zhongfeng Wang","doi":"10.1109/APCCAS50809.2020.9301653","DOIUrl":null,"url":null,"abstract":"Fast algorithms of convolution, such as Winograd and fast Fourier transformation (FFT), have been widely used in many FPGA-based CNN accelerators to reducing the complexity of multiplication. The core idea for those fast algorithms is reducing the number of multiplication at the cost of more additions. However, increased additions take up a significant portion in the whole LUT resources in many cases, which forms a new bottleneck in the corresponding hardware design. In this paper, we theoretically analyze the relationship between the reduced multiplications and the increased additions, and propose an reduced complexity fast FFT convolution algorithm by intelligently employing the FFT-pruning method to remove redundant additions. Compared with the state-of-the-art algorithm, our algorithm can reduce more than 50% of additions. Moreover, the proposed algorithm has better numerical accuracy and comparable multiplication complexity compared to the most efficient Winograd algorithm. Additionally, an efficient reconfigurable architecture of the proposed algorithm is also developed to accelerate convolutional layers with various kernel sizes. Implemented with Xilinx ZC706, the proposed architecture achieves 200.6 GOPS on convolutional layers of ResNet-50 with 61% higher resources efficiency with respect to LUT consumption compared to prior arts.","PeriodicalId":127075,"journal":{"name":"2020 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/APCCAS50809.2020.9301653","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Fast algorithms of convolution, such as Winograd and fast Fourier transformation (FFT), have been widely used in many FPGA-based CNN accelerators to reducing the complexity of multiplication. The core idea for those fast algorithms is reducing the number of multiplication at the cost of more additions. However, increased additions take up a significant portion in the whole LUT resources in many cases, which forms a new bottleneck in the corresponding hardware design. In this paper, we theoretically analyze the relationship between the reduced multiplications and the increased additions, and propose an reduced complexity fast FFT convolution algorithm by intelligently employing the FFT-pruning method to remove redundant additions. Compared with the state-of-the-art algorithm, our algorithm can reduce more than 50% of additions. Moreover, the proposed algorithm has better numerical accuracy and comparable multiplication complexity compared to the most efficient Winograd algorithm. Additionally, an efficient reconfigurable architecture of the proposed algorithm is also developed to accelerate convolutional layers with various kernel sizes. Implemented with Xilinx ZC706, the proposed architecture achieves 200.6 GOPS on convolutional layers of ResNet-50 with 61% higher resources efficiency with respect to LUT consumption compared to prior arts.

查看原文本刊更多论文

基于fft剪枝的CNN卷积高效FPGA设计

快速卷积算法，如Winograd和快速傅立叶变换(FFT)，已广泛应用于许多基于fpga的CNN加速器中，以降低乘法的复杂性。这些快速算法的核心思想是以增加加法的次数为代价来减少乘法的次数。然而，在很多情况下，增加的附加项占用了整个LUT资源的很大一部分，这在相应的硬件设计中形成了新的瓶颈。本文从理论上分析了减少的乘法和增加的加法之间的关系，并提出了一种降低复杂度的快速FFT卷积算法，该算法通过智能地使用FFT剪枝方法去除冗余的加法。与最先进的算法相比，我们的算法可以减少50%以上的加法。此外，与最有效的Winograd算法相比，该算法具有更好的数值精度和相当的乘法复杂度。此外，还开发了一种高效的可重构结构，以加速具有不同核大小的卷积层。采用Xilinx ZC706实现，所提出的架构在ResNet-50的卷积层上实现了200.6 GOPS，与现有技术相比，相对于LUT消耗，资源效率提高了61%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS)

自引率

0.00%

发文量