Going Deeper with Embedded FPGA Platform for Convolutional Neural Network

Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2016-02-21 DOI:10.1145/2847263.2847265

Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu, Tianqi Tang, Ningyi Xu, Sen Song, Yu Wang, Huazhong Yang

{"title":"Going Deeper with Embedded FPGA Platform for Convolutional Neural Network","authors":"Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu, Tianqi Tang, Ningyi Xu, Sen Song, Yu Wang, Huazhong Yang","doi":"10.1145/2847263.2847265","DOIUrl":null,"url":null,"abstract":"In recent years, convolutional neural network (CNN) based methods have achieved great success in a large number of applications and have been among the most powerful and widely used techniques in computer vision. However, CNN-based methods are com-putational-intensive and resource-consuming, and thus are hard to be integrated into embedded systems such as smart phones, smart glasses, and robots. FPGA is one of the most promising platforms for accelerating CNN, but the limited bandwidth and on-chip memory size limit the performance of FPGA accelerator for CNN. In this paper, we go deeper with the embedded FPGA platform on accelerating CNNs and propose a CNN accelerator design on embedded FPGA for Image-Net large-scale image classification. We first present an in-depth analysis of state-of-the-art CNN models and show that Convolutional layers are computational-centric and Fully-Connected layers are memory-centric. Then the dynamic-precision data quantization method and a convolver design that is efficient for all layer types in CNN are proposed to improve the bandwidth and resource utilization. Results show that only 0.4% accuracy loss is introduced by our data quantization flow for the very deep VGG16 model when 8/4-bit quantization is used. A data arrangement method is proposed to further ensure a high utilization of the external memory bandwidth. Finally, a state-of-the-art CNN, VGG16-SVD, is implemented on an embedded FPGA platform as a case study. VGG16-SVD is the largest and most accurate network that has been implemented on FPGA end-to-end so far. The system on Xilinx Zynq ZC706 board achieves a frame rate at 4.45 fps with the top-5 accuracy of 86.66% using 16-bit quantization. The average performance of convolutional layers and the full CNN is 187.8 GOP/s and 137.0 GOP/s under 150MHz working frequency, which outperform previous approaches significantly.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"85 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1051","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2847263.2847265","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1051

Abstract

In recent years, convolutional neural network (CNN) based methods have achieved great success in a large number of applications and have been among the most powerful and widely used techniques in computer vision. However, CNN-based methods are com-putational-intensive and resource-consuming, and thus are hard to be integrated into embedded systems such as smart phones, smart glasses, and robots. FPGA is one of the most promising platforms for accelerating CNN, but the limited bandwidth and on-chip memory size limit the performance of FPGA accelerator for CNN. In this paper, we go deeper with the embedded FPGA platform on accelerating CNNs and propose a CNN accelerator design on embedded FPGA for Image-Net large-scale image classification. We first present an in-depth analysis of state-of-the-art CNN models and show that Convolutional layers are computational-centric and Fully-Connected layers are memory-centric. Then the dynamic-precision data quantization method and a convolver design that is efficient for all layer types in CNN are proposed to improve the bandwidth and resource utilization. Results show that only 0.4% accuracy loss is introduced by our data quantization flow for the very deep VGG16 model when 8/4-bit quantization is used. A data arrangement method is proposed to further ensure a high utilization of the external memory bandwidth. Finally, a state-of-the-art CNN, VGG16-SVD, is implemented on an embedded FPGA platform as a case study. VGG16-SVD is the largest and most accurate network that has been implemented on FPGA end-to-end so far. The system on Xilinx Zynq ZC706 board achieves a frame rate at 4.45 fps with the top-5 accuracy of 86.66% using 16-bit quantization. The average performance of convolutional layers and the full CNN is 187.8 GOP/s and 137.0 GOP/s under 150MHz working frequency, which outperform previous approaches significantly.

查看原文本刊更多论文

深入探讨卷积神经网络的嵌入式FPGA平台

近年来，基于卷积神经网络(CNN)的方法在大量的应用中取得了巨大的成功，已经成为计算机视觉中最强大和应用最广泛的技术之一。然而，基于cnn的方法计算量大，资源消耗大，因此很难集成到智能手机、智能眼镜和机器人等嵌入式系统中。FPGA是最有前途的加速CNN的平台之一，但有限的带宽和片上存储器的大小限制了FPGA加速CNN的性能。本文深入研究了嵌入式FPGA平台对CNN的加速，提出了一种基于嵌入式FPGA的CNN加速器设计，用于image - net大规模图像分类。我们首先对最先进的CNN模型进行了深入分析，并表明卷积层以计算为中心，全连接层以内存为中心。为了提高带宽和资源利用率，提出了动态精度数据量化方法和一种对CNN中所有层类型都有效的卷积器设计。结果表明，当采用8/4位量化时，我们的数据量化流程对甚深VGG16模型的精度损失仅为0.4%。为了进一步保证外部存储器带宽的高利用率，提出了一种数据排列方法。最后，作为案例研究，在嵌入式FPGA平台上实现了最先进的CNN VGG16-SVD。VGG16-SVD是迄今为止在FPGA端到端实现的最大、最精确的网络。该系统在Xilinx Zynq ZC706板上采用16位量化，帧率为4.45 fps，前5精度为86.66%。在150MHz工作频率下，卷积层和完整CNN的平均性能分别为187.8 GOP/s和137.0 GOP/s，明显优于之前的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

自引率

0.00%

发文量