A Framework for Generating High Throughput CNN Implementations on FPGAs

Hanqing Zeng, Ren Chen, Chi Zhang, V. Prasanna
{"title":"A Framework for Generating High Throughput CNN Implementations on FPGAs","authors":"Hanqing Zeng, Ren Chen, Chi Zhang, V. Prasanna","doi":"10.1145/3174243.3174265","DOIUrl":null,"url":null,"abstract":"We propose a framework to generate highly efficient accelerators for inferencing on FPGAs. Our framework consists of multiple algorithmic optimizations for computation complexity and communication volume reduction, a mapping methodology for efficient resource utilization, and a tool for automatic \\textttVerilog generation. The algorithmic optimizations improve throughput of frequency domain convolution so as to satisfy a given set of hardware constraints. While the Overlap-and-Add (OaA) technique has been known, it performs \"wasted\" computation at the edges. We propose a novel Concatenate-and-Pad (CaP) technique, which improves OaA significantly by reducing the \"wasted\" computation on the padded pixels. The proposed CaP used in conjunction with OaA enables us to choose a fixed FFT size at design time, and achieve low computation complexity for layers with various image sizes and kernel window sizes. We also develop a novel frequency domain loop tiling technique to further boost throughput by improving data reuse. Our mapping methodology optimizes the architecture for the target device by fast design space exploration. We quantitatively categorize FPGAs by capturing their DSP resources, on-chip memory size and external memory bandwidth into a device coefficient. We identify the optimal architectural parameters based on the tradeoff between computation and communication cost. Our framework includes a tool to automatically generate fully synthesizable \\textttVerilog. We demonstrate the framework by generating high throughput accelerators for state-of-the-art CNN models on Intel HARP heterogeneous platform. Using our framework, we achieve throughput of $780.6$ $GOPS$, $669.1$ $GOPS$ and $552.1$ $GOPS$ for AlexNet, VGG16 and FCN-16s respectively. These correspond to $6.8\\times$ (AlexNet) and $4.9\\times$ (VGG16) improvement compared with the state-of-the-art implementations.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"76","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3174243.3174265","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 76

Abstract

We propose a framework to generate highly efficient accelerators for inferencing on FPGAs. Our framework consists of multiple algorithmic optimizations for computation complexity and communication volume reduction, a mapping methodology for efficient resource utilization, and a tool for automatic \textttVerilog generation. The algorithmic optimizations improve throughput of frequency domain convolution so as to satisfy a given set of hardware constraints. While the Overlap-and-Add (OaA) technique has been known, it performs "wasted" computation at the edges. We propose a novel Concatenate-and-Pad (CaP) technique, which improves OaA significantly by reducing the "wasted" computation on the padded pixels. The proposed CaP used in conjunction with OaA enables us to choose a fixed FFT size at design time, and achieve low computation complexity for layers with various image sizes and kernel window sizes. We also develop a novel frequency domain loop tiling technique to further boost throughput by improving data reuse. Our mapping methodology optimizes the architecture for the target device by fast design space exploration. We quantitatively categorize FPGAs by capturing their DSP resources, on-chip memory size and external memory bandwidth into a device coefficient. We identify the optimal architectural parameters based on the tradeoff between computation and communication cost. Our framework includes a tool to automatically generate fully synthesizable \textttVerilog. We demonstrate the framework by generating high throughput accelerators for state-of-the-art CNN models on Intel HARP heterogeneous platform. Using our framework, we achieve throughput of $780.6$ $GOPS$, $669.1$ $GOPS$ and $552.1$ $GOPS$ for AlexNet, VGG16 and FCN-16s respectively. These correspond to $6.8\times$ (AlexNet) and $4.9\times$ (VGG16) improvement compared with the state-of-the-art implementations.
在fpga上生成高吞吐量CNN实现的框架
我们提出了一个框架来生成高效的fpga推理加速器。我们的框架由多个算法优化组成,以减少计算复杂性和通信量,有效利用资源的映射方法,以及自动生成\textttVerilog的工具。算法优化提高了频域卷积的吞吐量,以满足给定的硬件约束。虽然重叠和添加(OaA)技术已经为人所知,但它在边缘执行“浪费”的计算。我们提出了一种新的连接和填充(CaP)技术,该技术通过减少填充像素上的“浪费”计算来显着提高OaA。与OaA结合使用的CaP使我们能够在设计时选择固定的FFT大小,并且对于不同图像大小和内核窗口大小的层实现较低的计算复杂度。我们还开发了一种新的频域环路平铺技术,通过改善数据重用来进一步提高吞吐量。我们的映射方法通过快速的设计空间探索来优化目标设备的架构。我们通过捕获它们的DSP资源、片上存储器大小和外部存储器带宽来定量地对fpga进行分类。我们根据计算和通信成本之间的权衡来确定最优的体系结构参数。我们的框架包括一个自动生成完全可合成的\ texttverilog的工具。我们通过在Intel HARP异构平台上为最先进的CNN模型生成高吞吐量加速器来演示该框架。使用我们的框架,我们分别为AlexNet, VGG16和fcn -16实现了$780.6$ $GOPS$, $669.1$ $GOPS$和$552.1$ $GOPS$。与最先进的实现相比,这相当于6.8美元(AlexNet)和4.9美元(VGG16)的改进。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信