APCNN: Explore Multi-Layer Cooperation for CNN Optimization and Acceleration on FPGA

Beilei Jiang, Xianwei Cheng, Sihai Tang, Xu Ma, Zhaochen Gu, Hui Zhao, Song Fu
{"title":"APCNN: Explore Multi-Layer Cooperation for CNN Optimization and Acceleration on FPGA","authors":"Beilei Jiang, Xianwei Cheng, Sihai Tang, Xu Ma, Zhaochen Gu, Hui Zhao, Song Fu","doi":"10.1145/3431920.3439461","DOIUrl":null,"url":null,"abstract":"In this paper, we introduce APCNN, which explores algorithm-hardware co-design and provides a CNN acceleration framework with multi-layer cooperative optimization and customized design on FPGA. In terms of the algorithm design, the pooling layer is moved before the non-linear activation function and normalization in APCNN, which we prove causes negligible accuracy loss; the pooling layer is then co-optimized with the convolutional layer by means of redundant multiplication elimination, local addition reuse, and global addition reuse. We further design a dedicated accelerator to take full advantage of convolutional-pooling cross-layer optimization to not only accelerate computation but also reduce on-off chip data communication on FPGA. We demonstrate that our novel APCNN can achieve 75% multiplication and 75% addition reduction in the best case. For on-off chip data communication, a max{Row,Col} /(Row x Col) percent of memory footprint can be eliminated, where Row and Col are the number of rows and columns in the activation feature map respectively. We have implemented a prototype of APCNN and evaluated its performance on LeNet-5 and VGG16 using both an accelerator-level cycle and energy model and an RTL implementation. Our experimental results show that APCNN achieves a 2.5× speedup and 4.7× energy efficiency compared with the dense CNN. (This research was supported in part by NSF grants CCF-1563750, OAC-2017564, and CNS-2037982.)","PeriodicalId":386071,"journal":{"name":"The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3431920.3439461","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

In this paper, we introduce APCNN, which explores algorithm-hardware co-design and provides a CNN acceleration framework with multi-layer cooperative optimization and customized design on FPGA. In terms of the algorithm design, the pooling layer is moved before the non-linear activation function and normalization in APCNN, which we prove causes negligible accuracy loss; the pooling layer is then co-optimized with the convolutional layer by means of redundant multiplication elimination, local addition reuse, and global addition reuse. We further design a dedicated accelerator to take full advantage of convolutional-pooling cross-layer optimization to not only accelerate computation but also reduce on-off chip data communication on FPGA. We demonstrate that our novel APCNN can achieve 75% multiplication and 75% addition reduction in the best case. For on-off chip data communication, a max{Row,Col} /(Row x Col) percent of memory footprint can be eliminated, where Row and Col are the number of rows and columns in the activation feature map respectively. We have implemented a prototype of APCNN and evaluated its performance on LeNet-5 and VGG16 using both an accelerator-level cycle and energy model and an RTL implementation. Our experimental results show that APCNN achieves a 2.5× speedup and 4.7× energy efficiency compared with the dense CNN. (This research was supported in part by NSF grants CCF-1563750, OAC-2017564, and CNS-2037982.)
APCNN:探索基于FPGA的CNN优化和加速的多层合作
本文介绍了APCNN,它探索了算法-硬件协同设计,并在FPGA上提供了一个具有多层协同优化和定制设计的CNN加速框架。在算法设计方面,在APCNN中,池化层被移到非线性激活函数和归一化之前,我们证明了这样做的精度损失可以忽略不计;然后通过消除冗余乘法、局部加法重用和全局加法重用,池化层与卷积层进行协同优化。我们进一步设计了一个专用加速器,充分利用卷积池跨层优化,不仅可以加速计算,还可以减少FPGA上的开关芯片数据通信。我们证明了我们的新颖APCNN在最佳情况下可以实现75%的乘法和75%的加法减少。对于开关芯片数据通信,可以消除最大{Row,Col} /(Row x Col)百分比的内存占用,其中Row和Col分别是激活特征映射中的行数和列数。我们已经实现了APCNN的原型,并使用加速器级循环和能量模型和RTL实现评估了其在LeNet-5和VGG16上的性能。我们的实验结果表明,与密集CNN相比,APCNN的速度提高了2.5倍,能效提高了4.7倍。(本研究得到了美国国家科学基金会CCF-1563750、OAC-2017564和CNS-2037982的部分资助。)
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信