APCNN: Explore Multi-Layer Cooperation for CNN Optimization and Acceleration on FPGA

The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2021-02-17 DOI:10.1145/3431920.3439461

Beilei Jiang, Xianwei Cheng, Sihai Tang, Xu Ma, Zhaochen Gu, Hui Zhao, Song Fu

{"title":"APCNN: Explore Multi-Layer Cooperation for CNN Optimization and Acceleration on FPGA","authors":"Beilei Jiang, Xianwei Cheng, Sihai Tang, Xu Ma, Zhaochen Gu, Hui Zhao, Song Fu","doi":"10.1145/3431920.3439461","DOIUrl":null,"url":null,"abstract":"In this paper, we introduce APCNN, which explores algorithm-hardware co-design and provides a CNN acceleration framework with multi-layer cooperative optimization and customized design on FPGA. In terms of the algorithm design, the pooling layer is moved before the non-linear activation function and normalization in APCNN, which we prove causes negligible accuracy loss; the pooling layer is then co-optimized with the convolutional layer by means of redundant multiplication elimination, local addition reuse, and global addition reuse. We further design a dedicated accelerator to take full advantage of convolutional-pooling cross-layer optimization to not only accelerate computation but also reduce on-off chip data communication on FPGA. We demonstrate that our novel APCNN can achieve 75% multiplication and 75% addition reduction in the best case. For on-off chip data communication, a max{Row,Col} /(Row x Col) percent of memory footprint can be eliminated, where Row and Col are the number of rows and columns in the activation feature map respectively. We have implemented a prototype of APCNN and evaluated its performance on LeNet-5 and VGG16 using both an accelerator-level cycle and energy model and an RTL implementation. Our experimental results show that APCNN achieves a 2.5× speedup and 4.7× energy efficiency compared with the dense CNN. (This research was supported in part by NSF grants CCF-1563750, OAC-2017564, and CNS-2037982.)","PeriodicalId":386071,"journal":{"name":"The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3431920.3439461","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

In this paper, we introduce APCNN, which explores algorithm-hardware co-design and provides a CNN acceleration framework with multi-layer cooperative optimization and customized design on FPGA. In terms of the algorithm design, the pooling layer is moved before the non-linear activation function and normalization in APCNN, which we prove causes negligible accuracy loss; the pooling layer is then co-optimized with the convolutional layer by means of redundant multiplication elimination, local addition reuse, and global addition reuse. We further design a dedicated accelerator to take full advantage of convolutional-pooling cross-layer optimization to not only accelerate computation but also reduce on-off chip data communication on FPGA. We demonstrate that our novel APCNN can achieve 75% multiplication and 75% addition reduction in the best case. For on-off chip data communication, a max{Row,Col} /(Row x Col) percent of memory footprint can be eliminated, where Row and Col are the number of rows and columns in the activation feature map respectively. We have implemented a prototype of APCNN and evaluated its performance on LeNet-5 and VGG16 using both an accelerator-level cycle and energy model and an RTL implementation. Our experimental results show that APCNN achieves a 2.5× speedup and 4.7× energy efficiency compared with the dense CNN. (This research was supported in part by NSF grants CCF-1563750, OAC-2017564, and CNS-2037982.)

查看原文本刊更多论文

APCNN:探索基于FPGA的CNN优化和加速的多层合作

本文介绍了APCNN，它探索了算法-硬件协同设计，并在FPGA上提供了一个具有多层协同优化和定制设计的CNN加速框架。在算法设计方面，在APCNN中，池化层被移到非线性激活函数和归一化之前，我们证明了这样做的精度损失可以忽略不计;然后通过消除冗余乘法、局部加法重用和全局加法重用，池化层与卷积层进行协同优化。我们进一步设计了一个专用加速器，充分利用卷积池跨层优化，不仅可以加速计算，还可以减少FPGA上的开关芯片数据通信。我们证明了我们的新颖APCNN在最佳情况下可以实现75%的乘法和75%的加法减少。对于开关芯片数据通信，可以消除最大{Row,Col} /(Row x Col)百分比的内存占用，其中Row和Col分别是激活特征映射中的行数和列数。我们已经实现了APCNN的原型，并使用加速器级循环和能量模型和RTL实现评估了其在LeNet-5和VGG16上的性能。我们的实验结果表明，与密集CNN相比，APCNN的速度提高了2.5倍，能效提高了4.7倍。(本研究得到了美国国家科学基金会CCF-1563750、OAC-2017564和CNS-2037982的部分资助。)

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

自引率

0.00%

发文量