Beilei Jiang, Xianwei Cheng, Sihai Tang, Xu Ma, Zhaochen Gu, Hui Zhao, Song Fu
{"title":"APCNN: Explore Multi-Layer Cooperation for CNN Optimization and Acceleration on FPGA","authors":"Beilei Jiang, Xianwei Cheng, Sihai Tang, Xu Ma, Zhaochen Gu, Hui Zhao, Song Fu","doi":"10.1145/3431920.3439461","DOIUrl":null,"url":null,"abstract":"In this paper, we introduce APCNN, which explores algorithm-hardware co-design and provides a CNN acceleration framework with multi-layer cooperative optimization and customized design on FPGA. In terms of the algorithm design, the pooling layer is moved before the non-linear activation function and normalization in APCNN, which we prove causes negligible accuracy loss; the pooling layer is then co-optimized with the convolutional layer by means of redundant multiplication elimination, local addition reuse, and global addition reuse. We further design a dedicated accelerator to take full advantage of convolutional-pooling cross-layer optimization to not only accelerate computation but also reduce on-off chip data communication on FPGA. We demonstrate that our novel APCNN can achieve 75% multiplication and 75% addition reduction in the best case. For on-off chip data communication, a max{Row,Col} /(Row x Col) percent of memory footprint can be eliminated, where Row and Col are the number of rows and columns in the activation feature map respectively. We have implemented a prototype of APCNN and evaluated its performance on LeNet-5 and VGG16 using both an accelerator-level cycle and energy model and an RTL implementation. Our experimental results show that APCNN achieves a 2.5× speedup and 4.7× energy efficiency compared with the dense CNN. (This research was supported in part by NSF grants CCF-1563750, OAC-2017564, and CNS-2037982.)","PeriodicalId":386071,"journal":{"name":"The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3431920.3439461","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
In this paper, we introduce APCNN, which explores algorithm-hardware co-design and provides a CNN acceleration framework with multi-layer cooperative optimization and customized design on FPGA. In terms of the algorithm design, the pooling layer is moved before the non-linear activation function and normalization in APCNN, which we prove causes negligible accuracy loss; the pooling layer is then co-optimized with the convolutional layer by means of redundant multiplication elimination, local addition reuse, and global addition reuse. We further design a dedicated accelerator to take full advantage of convolutional-pooling cross-layer optimization to not only accelerate computation but also reduce on-off chip data communication on FPGA. We demonstrate that our novel APCNN can achieve 75% multiplication and 75% addition reduction in the best case. For on-off chip data communication, a max{Row,Col} /(Row x Col) percent of memory footprint can be eliminated, where Row and Col are the number of rows and columns in the activation feature map respectively. We have implemented a prototype of APCNN and evaluated its performance on LeNet-5 and VGG16 using both an accelerator-level cycle and energy model and an RTL implementation. Our experimental results show that APCNN achieves a 2.5× speedup and 4.7× energy efficiency compared with the dense CNN. (This research was supported in part by NSF grants CCF-1563750, OAC-2017564, and CNS-2037982.)
本文介绍了APCNN,它探索了算法-硬件协同设计,并在FPGA上提供了一个具有多层协同优化和定制设计的CNN加速框架。在算法设计方面,在APCNN中,池化层被移到非线性激活函数和归一化之前,我们证明了这样做的精度损失可以忽略不计;然后通过消除冗余乘法、局部加法重用和全局加法重用,池化层与卷积层进行协同优化。我们进一步设计了一个专用加速器,充分利用卷积池跨层优化,不仅可以加速计算,还可以减少FPGA上的开关芯片数据通信。我们证明了我们的新颖APCNN在最佳情况下可以实现75%的乘法和75%的加法减少。对于开关芯片数据通信,可以消除最大{Row,Col} /(Row x Col)百分比的内存占用,其中Row和Col分别是激活特征映射中的行数和列数。我们已经实现了APCNN的原型,并使用加速器级循环和能量模型和RTL实现评估了其在LeNet-5和VGG16上的性能。我们的实验结果表明,与密集CNN相比,APCNN的速度提高了2.5倍,能效提高了4.7倍。(本研究得到了美国国家科学基金会CCF-1563750、OAC-2017564和CNS-2037982的部分资助。)