Efficient FPGA design for Convolutions in CNN based on FFT-pruning

Liulu He, Xiaoru Xie, Jun Lin, Zhongfeng Wang
{"title":"Efficient FPGA design for Convolutions in CNN based on FFT-pruning","authors":"Liulu He, Xiaoru Xie, Jun Lin, Zhongfeng Wang","doi":"10.1109/APCCAS50809.2020.9301653","DOIUrl":null,"url":null,"abstract":"Fast algorithms of convolution, such as Winograd and fast Fourier transformation (FFT), have been widely used in many FPGA-based CNN accelerators to reducing the complexity of multiplication. The core idea for those fast algorithms is reducing the number of multiplication at the cost of more additions. However, increased additions take up a significant portion in the whole LUT resources in many cases, which forms a new bottleneck in the corresponding hardware design. In this paper, we theoretically analyze the relationship between the reduced multiplications and the increased additions, and propose an reduced complexity fast FFT convolution algorithm by intelligently employing the FFT-pruning method to remove redundant additions. Compared with the state-of-the-art algorithm, our algorithm can reduce more than 50% of additions. Moreover, the proposed algorithm has better numerical accuracy and comparable multiplication complexity compared to the most efficient Winograd algorithm. Additionally, an efficient reconfigurable architecture of the proposed algorithm is also developed to accelerate convolutional layers with various kernel sizes. Implemented with Xilinx ZC706, the proposed architecture achieves 200.6 GOPS on convolutional layers of ResNet-50 with 61% higher resources efficiency with respect to LUT consumption compared to prior arts.","PeriodicalId":127075,"journal":{"name":"2020 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/APCCAS50809.2020.9301653","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Fast algorithms of convolution, such as Winograd and fast Fourier transformation (FFT), have been widely used in many FPGA-based CNN accelerators to reducing the complexity of multiplication. The core idea for those fast algorithms is reducing the number of multiplication at the cost of more additions. However, increased additions take up a significant portion in the whole LUT resources in many cases, which forms a new bottleneck in the corresponding hardware design. In this paper, we theoretically analyze the relationship between the reduced multiplications and the increased additions, and propose an reduced complexity fast FFT convolution algorithm by intelligently employing the FFT-pruning method to remove redundant additions. Compared with the state-of-the-art algorithm, our algorithm can reduce more than 50% of additions. Moreover, the proposed algorithm has better numerical accuracy and comparable multiplication complexity compared to the most efficient Winograd algorithm. Additionally, an efficient reconfigurable architecture of the proposed algorithm is also developed to accelerate convolutional layers with various kernel sizes. Implemented with Xilinx ZC706, the proposed architecture achieves 200.6 GOPS on convolutional layers of ResNet-50 with 61% higher resources efficiency with respect to LUT consumption compared to prior arts.
基于fft剪枝的CNN卷积高效FPGA设计
快速卷积算法,如Winograd和快速傅立叶变换(FFT),已广泛应用于许多基于fpga的CNN加速器中,以降低乘法的复杂性。这些快速算法的核心思想是以增加加法的次数为代价来减少乘法的次数。然而,在很多情况下,增加的附加项占用了整个LUT资源的很大一部分,这在相应的硬件设计中形成了新的瓶颈。本文从理论上分析了减少的乘法和增加的加法之间的关系,并提出了一种降低复杂度的快速FFT卷积算法,该算法通过智能地使用FFT剪枝方法去除冗余的加法。与最先进的算法相比,我们的算法可以减少50%以上的加法。此外,与最有效的Winograd算法相比,该算法具有更好的数值精度和相当的乘法复杂度。此外,还开发了一种高效的可重构结构,以加速具有不同核大小的卷积层。采用Xilinx ZC706实现,所提出的架构在ResNet-50的卷积层上实现了200.6 GOPS,与现有技术相比,相对于LUT消耗,资源效率提高了61%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信