A Block-Floating-Point Arithmetic Based FPGA Accelerator for Convolutional Neural Networks

H. Zhang, Zhenyu Liu, Guanwen Zhang, Jiwu Dai, Xiaocong Lian, W. Zhou, Xiangyang Ji
{"title":"A Block-Floating-Point Arithmetic Based FPGA Accelerator for Convolutional Neural Networks","authors":"H. Zhang, Zhenyu Liu, Guanwen Zhang, Jiwu Dai, Xiaocong Lian, W. Zhou, Xiangyang Ji","doi":"10.1109/GlobalSIP45357.2019.8969292","DOIUrl":null,"url":null,"abstract":"Convolutional neural networks (CNNs) have been widely used in computer vision applications and achieved great success. However, large-scale CNN models usually consume a lot of computing and memory resources, which makes it difficult for them to be deployed on embedded devices. An efficient block-floating-point (BFP) arithmetic is proposed in this paper. compared with 32-bit floating-point arithmetic, the memory and off-chip bandwidth requirements during convolution are reduced by 50% and 72.37%, respectively. Due to the adoption of BFP arithmetic, the complex multiplication and addition operations of floating-point numbers can be replaced by the corresponding operations of fixed-point numbers, which is more efficient on hardware. A CNN model can be deployed on our accelerator with no more than 0.14% top-1 accuracy loss, and there is no need for retraining and fine-tuning. By employing a series of ping-pong memory access schemes, 2-dimensional propagate partial multiply-accumulate (PPMAC) processors, and an optimized memory system, we implemented a CNN accelerator on Xilinx VC709 evaluation board. The accelerator achieves a performance of 665.54 GOP/s and a power efficiency of 89.7 GOP/s/W under a 300 MHz working frequency, which outperforms previous FPGA based accelerators significantly.","PeriodicalId":221378,"journal":{"name":"2019 IEEE Global Conference on Signal and Information Processing (GlobalSIP)","volume":"254 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE Global Conference on Signal and Information Processing (GlobalSIP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/GlobalSIP45357.2019.8969292","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Convolutional neural networks (CNNs) have been widely used in computer vision applications and achieved great success. However, large-scale CNN models usually consume a lot of computing and memory resources, which makes it difficult for them to be deployed on embedded devices. An efficient block-floating-point (BFP) arithmetic is proposed in this paper. compared with 32-bit floating-point arithmetic, the memory and off-chip bandwidth requirements during convolution are reduced by 50% and 72.37%, respectively. Due to the adoption of BFP arithmetic, the complex multiplication and addition operations of floating-point numbers can be replaced by the corresponding operations of fixed-point numbers, which is more efficient on hardware. A CNN model can be deployed on our accelerator with no more than 0.14% top-1 accuracy loss, and there is no need for retraining and fine-tuning. By employing a series of ping-pong memory access schemes, 2-dimensional propagate partial multiply-accumulate (PPMAC) processors, and an optimized memory system, we implemented a CNN accelerator on Xilinx VC709 evaluation board. The accelerator achieves a performance of 665.54 GOP/s and a power efficiency of 89.7 GOP/s/W under a 300 MHz working frequency, which outperforms previous FPGA based accelerators significantly.
基于块浮点算法的FPGA卷积神经网络加速器
卷积神经网络(cnn)在计算机视觉应用中得到了广泛的应用,并取得了巨大的成功。然而,大规模的CNN模型通常会消耗大量的计算和内存资源,这使得它们很难部署在嵌入式设备上。提出了一种高效的块浮点(BFP)算法。与32位浮点算法相比,卷积对内存和片外带宽的要求分别降低了50%和72.37%。由于采用了BFP算法,浮点数的复杂乘法和加法运算可以用定点数的相应运算代替,在硬件上效率更高。一个CNN模型可以部署在我们的加速器上,其top-1精度损失不超过0.14%,并且不需要再训练和微调。通过采用一系列乒乓存储器访问方案、二维传播部分乘法累积(PPMAC)处理器和优化的存储器系统,我们在Xilinx VC709评估板上实现了CNN加速器。在300 MHz工作频率下,该加速器的性能为665.54 GOP/s,功率效率为89.7 GOP/s/W,显著优于以往基于FPGA的加速器。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信