在fpga上生成高吞吐量CNN实现的框架

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2018-02-15 DOI:10.1145/3174243.3174265

Hanqing Zeng, Ren Chen, Chi Zhang, V. Prasanna

{"title":"在fpga上生成高吞吐量CNN实现的框架","authors":"Hanqing Zeng, Ren Chen, Chi Zhang, V. Prasanna","doi":"10.1145/3174243.3174265","DOIUrl":null,"url":null,"abstract":"We propose a framework to generate highly efficient accelerators for inferencing on FPGAs. Our framework consists of multiple algorithmic optimizations for computation complexity and communication volume reduction, a mapping methodology for efficient resource utilization, and a tool for automatic \\textttVerilog generation. The algorithmic optimizations improve throughput of frequency domain convolution so as to satisfy a given set of hardware constraints. While the Overlap-and-Add (OaA) technique has been known, it performs \"wasted\" computation at the edges. We propose a novel Concatenate-and-Pad (CaP) technique, which improves OaA significantly by reducing the \"wasted\" computation on the padded pixels. The proposed CaP used in conjunction with OaA enables us to choose a fixed FFT size at design time, and achieve low computation complexity for layers with various image sizes and kernel window sizes. We also develop a novel frequency domain loop tiling technique to further boost throughput by improving data reuse. Our mapping methodology optimizes the architecture for the target device by fast design space exploration. We quantitatively categorize FPGAs by capturing their DSP resources, on-chip memory size and external memory bandwidth into a device coefficient. We identify the optimal architectural parameters based on the tradeoff between computation and communication cost. Our framework includes a tool to automatically generate fully synthesizable \\textttVerilog. We demonstrate the framework by generating high throughput accelerators for state-of-the-art CNN models on Intel HARP heterogeneous platform. Using our framework, we achieve throughput of $780.6$ $GOPS$, $669.1$ $GOPS$ and $552.1$ $GOPS$ for AlexNet, VGG16 and FCN-16s respectively. These correspond to $6.8\\times$ (AlexNet) and $4.9\\times$ (VGG16) improvement compared with the state-of-the-art implementations.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"76","resultStr":"{\"title\":\"A Framework for Generating High Throughput CNN Implementations on FPGAs\",\"authors\":\"Hanqing Zeng, Ren Chen, Chi Zhang, V. Prasanna\",\"doi\":\"10.1145/3174243.3174265\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We propose a framework to generate highly efficient accelerators for inferencing on FPGAs. Our framework consists of multiple algorithmic optimizations for computation complexity and communication volume reduction, a mapping methodology for efficient resource utilization, and a tool for automatic \\\\textttVerilog generation. The algorithmic optimizations improve throughput of frequency domain convolution so as to satisfy a given set of hardware constraints. While the Overlap-and-Add (OaA) technique has been known, it performs \\\"wasted\\\" computation at the edges. We propose a novel Concatenate-and-Pad (CaP) technique, which improves OaA significantly by reducing the \\\"wasted\\\" computation on the padded pixels. The proposed CaP used in conjunction with OaA enables us to choose a fixed FFT size at design time, and achieve low computation complexity for layers with various image sizes and kernel window sizes. We also develop a novel frequency domain loop tiling technique to further boost throughput by improving data reuse. Our mapping methodology optimizes the architecture for the target device by fast design space exploration. We quantitatively categorize FPGAs by capturing their DSP resources, on-chip memory size and external memory bandwidth into a device coefficient. We identify the optimal architectural parameters based on the tradeoff between computation and communication cost. Our framework includes a tool to automatically generate fully synthesizable \\\\textttVerilog. We demonstrate the framework by generating high throughput accelerators for state-of-the-art CNN models on Intel HARP heterogeneous platform. Using our framework, we achieve throughput of $780.6$ $GOPS$, $669.1$ $GOPS$ and $552.1$ $GOPS$ for AlexNet, VGG16 and FCN-16s respectively. These correspond to $6.8\\\\times$ (AlexNet) and $4.9\\\\times$ (VGG16) improvement compared with the state-of-the-art implementations.\",\"PeriodicalId\":164936,\"journal\":{\"name\":\"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays\",\"volume\":\"8 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-02-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"76\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3174243.3174265\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3174243.3174265","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 76

摘要

我们提出了一个框架来生成高效的fpga推理加速器。我们的框架由多个算法优化组成，以减少计算复杂性和通信量，有效利用资源的映射方法，以及自动生成\textttVerilog的工具。算法优化提高了频域卷积的吞吐量，以满足给定的硬件约束。虽然重叠和添加(OaA)技术已经为人所知，但它在边缘执行“浪费”的计算。我们提出了一种新的连接和填充(CaP)技术，该技术通过减少填充像素上的“浪费”计算来显着提高OaA。与OaA结合使用的CaP使我们能够在设计时选择固定的FFT大小，并且对于不同图像大小和内核窗口大小的层实现较低的计算复杂度。我们还开发了一种新的频域环路平铺技术，通过改善数据重用来进一步提高吞吐量。我们的映射方法通过快速的设计空间探索来优化目标设备的架构。我们通过捕获它们的DSP资源、片上存储器大小和外部存储器带宽来定量地对fpga进行分类。我们根据计算和通信成本之间的权衡来确定最优的体系结构参数。我们的框架包括一个自动生成完全可合成的\ texttverilog的工具。我们通过在Intel HARP异构平台上为最先进的CNN模型生成高吞吐量加速器来演示该框架。使用我们的框架，我们分别为AlexNet, VGG16和fcn -16实现了$780.6$ $GOPS$， $669.1$ $GOPS$和$552.1$ $GOPS$。与最先进的实现相比，这相当于6.8美元(AlexNet)和4.9美元(VGG16)的改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Framework for Generating High Throughput CNN Implementations on FPGAs

We propose a framework to generate highly efficient accelerators for inferencing on FPGAs. Our framework consists of multiple algorithmic optimizations for computation complexity and communication volume reduction, a mapping methodology for efficient resource utilization, and a tool for automatic \textttVerilog generation. The algorithmic optimizations improve throughput of frequency domain convolution so as to satisfy a given set of hardware constraints. While the Overlap-and-Add (OaA) technique has been known, it performs "wasted" computation at the edges. We propose a novel Concatenate-and-Pad (CaP) technique, which improves OaA significantly by reducing the "wasted" computation on the padded pixels. The proposed CaP used in conjunction with OaA enables us to choose a fixed FFT size at design time, and achieve low computation complexity for layers with various image sizes and kernel window sizes. We also develop a novel frequency domain loop tiling technique to further boost throughput by improving data reuse. Our mapping methodology optimizes the architecture for the target device by fast design space exploration. We quantitatively categorize FPGAs by capturing their DSP resources, on-chip memory size and external memory bandwidth into a device coefficient. We identify the optimal architectural parameters based on the tradeoff between computation and communication cost. Our framework includes a tool to automatically generate fully synthesizable \textttVerilog. We demonstrate the framework by generating high throughput accelerators for state-of-the-art CNN models on Intel HARP heterogeneous platform. Using our framework, we achieve throughput of $780.6$ $GOPS$, $669.1$ $GOPS$ and $552.1$ $GOPS$ for AlexNet, VGG16 and FCN-16s respectively. These correspond to $6.8\times$ (AlexNet) and $4.9\times$ (VGG16) improvement compared with the state-of-the-art implementations.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

自引率

0.00%

发文量