A Block-Floating-Point Arithmetic Based FPGA Accelerator for Convolutional Neural Networks

2019 IEEE Global Conference on Signal and Information Processing (GlobalSIP) Pub Date : 2019-11-01 DOI:10.1109/GlobalSIP45357.2019.8969292

H. Zhang, Zhenyu Liu, Guanwen Zhang, Jiwu Dai, Xiaocong Lian, W. Zhou, Xiangyang Ji

{"title":"A Block-Floating-Point Arithmetic Based FPGA Accelerator for Convolutional Neural Networks","authors":"H. Zhang, Zhenyu Liu, Guanwen Zhang, Jiwu Dai, Xiaocong Lian, W. Zhou, Xiangyang Ji","doi":"10.1109/GlobalSIP45357.2019.8969292","DOIUrl":null,"url":null,"abstract":"Convolutional neural networks (CNNs) have been widely used in computer vision applications and achieved great success. However, large-scale CNN models usually consume a lot of computing and memory resources, which makes it difficult for them to be deployed on embedded devices. An efficient block-floating-point (BFP) arithmetic is proposed in this paper. compared with 32-bit floating-point arithmetic, the memory and off-chip bandwidth requirements during convolution are reduced by 50% and 72.37%, respectively. Due to the adoption of BFP arithmetic, the complex multiplication and addition operations of floating-point numbers can be replaced by the corresponding operations of fixed-point numbers, which is more efficient on hardware. A CNN model can be deployed on our accelerator with no more than 0.14% top-1 accuracy loss, and there is no need for retraining and fine-tuning. By employing a series of ping-pong memory access schemes, 2-dimensional propagate partial multiply-accumulate (PPMAC) processors, and an optimized memory system, we implemented a CNN accelerator on Xilinx VC709 evaluation board. The accelerator achieves a performance of 665.54 GOP/s and a power efficiency of 89.7 GOP/s/W under a 300 MHz working frequency, which outperforms previous FPGA based accelerators significantly.","PeriodicalId":221378,"journal":{"name":"2019 IEEE Global Conference on Signal and Information Processing (GlobalSIP)","volume":"254 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE Global Conference on Signal and Information Processing (GlobalSIP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/GlobalSIP45357.2019.8969292","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Convolutional neural networks (CNNs) have been widely used in computer vision applications and achieved great success. However, large-scale CNN models usually consume a lot of computing and memory resources, which makes it difficult for them to be deployed on embedded devices. An efficient block-floating-point (BFP) arithmetic is proposed in this paper. compared with 32-bit floating-point arithmetic, the memory and off-chip bandwidth requirements during convolution are reduced by 50% and 72.37%, respectively. Due to the adoption of BFP arithmetic, the complex multiplication and addition operations of floating-point numbers can be replaced by the corresponding operations of fixed-point numbers, which is more efficient on hardware. A CNN model can be deployed on our accelerator with no more than 0.14% top-1 accuracy loss, and there is no need for retraining and fine-tuning. By employing a series of ping-pong memory access schemes, 2-dimensional propagate partial multiply-accumulate (PPMAC) processors, and an optimized memory system, we implemented a CNN accelerator on Xilinx VC709 evaluation board. The accelerator achieves a performance of 665.54 GOP/s and a power efficiency of 89.7 GOP/s/W under a 300 MHz working frequency, which outperforms previous FPGA based accelerators significantly.

查看原文本刊更多论文

基于块浮点算法的FPGA卷积神经网络加速器

卷积神经网络(cnn)在计算机视觉应用中得到了广泛的应用，并取得了巨大的成功。然而，大规模的CNN模型通常会消耗大量的计算和内存资源，这使得它们很难部署在嵌入式设备上。提出了一种高效的块浮点(BFP)算法。与32位浮点算法相比，卷积对内存和片外带宽的要求分别降低了50%和72.37%。由于采用了BFP算法，浮点数的复杂乘法和加法运算可以用定点数的相应运算代替，在硬件上效率更高。一个CNN模型可以部署在我们的加速器上，其top-1精度损失不超过0.14%，并且不需要再训练和微调。通过采用一系列乒乓存储器访问方案、二维传播部分乘法累积(PPMAC)处理器和优化的存储器系统，我们在Xilinx VC709评估板上实现了CNN加速器。在300 MHz工作频率下，该加速器的性能为665.54 GOP/s，功率效率为89.7 GOP/s/W，显著优于以往基于FPGA的加速器。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 IEEE Global Conference on Signal and Information Processing (GlobalSIP)

自引率

0.00%

发文量