A Tri-State Weight Convolutional Neural Network for an FPGA: Applied to YOLOv2 Object Detector

2018 International Conference on Field-Programmable Technology (FPT) Pub Date : 2018-12-01 DOI:10.1109/FPT.2018.00058

Hiroki Nakahara, Masayuki Shimoda, Shimpei Sato

{"title":"A Tri-State Weight Convolutional Neural Network for an FPGA: Applied to YOLOv2 Object Detector","authors":"Hiroki Nakahara, Masayuki Shimoda, Shimpei Sato","doi":"10.1109/FPT.2018.00058","DOIUrl":null,"url":null,"abstract":"A frame object detection, such as the YOLO (You only look once), is used in embedded vision systems, such as a robot, an automobile, a security camera, and a drone. However, it requires highly performance-per-power detection by an inexpensive device. In the paper, we propose a tri-state weight CNN, which is a generalization of a low-precision and sparse (pruning) for CNN weight. In the former part, we set a weight {-1,0,+1} as a ternary CNN, while in the latter part, we set a {-w,0,+w} as a sparse weight CNN. The proposed tri-state CNN is a kind of a mixed-precision one, which is suitable for an object detector consisting of a bounding box prediction (regression) and a class estimation (classification). We apply an indirect memory access architecture to skip zero part and propose the weight parallel 2D convolutional circuit. It can efficiently be applied to the AlexNet based CNN, which has different size kernels. We design the AlexNet based YOLOv2 to reduce the number of layers toward low-latency computation. In the experiment, the proposed tri-state scheme CNN reduces the memory size for weight by 92%. We implement the proposed tri-state weight YOLOv2 on the AvNet Inc. UltraZed-EG starter kit, which has the Xilinx Inc. Zynq Ultrascale+ MPSoC ZU3EG. It archived 61.70 frames per second (FPS), which exceeds the standard video frame rate (29.97 FPS). Compared with the ARM Cortex-A57, it was 268.2 times faster, and its performance per power efficiency was 313.51 times better. Also, compared with the NVidia Pascal embedded GPU, it was 4.0 times faster, and its power performance efficiency was 11.35 times better.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"136 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 International Conference on Field-Programmable Technology (FPT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FPT.2018.00058","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

A frame object detection, such as the YOLO (You only look once), is used in embedded vision systems, such as a robot, an automobile, a security camera, and a drone. However, it requires highly performance-per-power detection by an inexpensive device. In the paper, we propose a tri-state weight CNN, which is a generalization of a low-precision and sparse (pruning) for CNN weight. In the former part, we set a weight {-1,0,+1} as a ternary CNN, while in the latter part, we set a {-w,0,+w} as a sparse weight CNN. The proposed tri-state CNN is a kind of a mixed-precision one, which is suitable for an object detector consisting of a bounding box prediction (regression) and a class estimation (classification). We apply an indirect memory access architecture to skip zero part and propose the weight parallel 2D convolutional circuit. It can efficiently be applied to the AlexNet based CNN, which has different size kernels. We design the AlexNet based YOLOv2 to reduce the number of layers toward low-latency computation. In the experiment, the proposed tri-state scheme CNN reduces the memory size for weight by 92%. We implement the proposed tri-state weight YOLOv2 on the AvNet Inc. UltraZed-EG starter kit, which has the Xilinx Inc. Zynq Ultrascale+ MPSoC ZU3EG. It archived 61.70 frames per second (FPS), which exceeds the standard video frame rate (29.97 FPS). Compared with the ARM Cortex-A57, it was 268.2 times faster, and its performance per power efficiency was 313.51 times better. Also, compared with the NVidia Pascal embedded GPU, it was 4.0 times faster, and its power performance efficiency was 11.35 times better.

查看原文本刊更多论文

基于FPGA的三态权重卷积神经网络:应用于YOLOv2目标检测器

框架对象检测，如YOLO(你只看一次)，用于嵌入式视觉系统，如机器人、汽车、安全摄像头和无人机。然而，它需要通过廉价的设备进行高性能/功率检测。在本文中，我们提出了一种三状态加权CNN，它是对CNN权值的低精度和稀疏(修剪)的推广。在前一部分中，我们将权值{-1,0，+1}设置为三元CNN，在后一部分中，我们将权值{-w,0，+w}设置为稀疏权值CNN。本文提出的三态CNN是一种混合精度CNN，适用于由边界盒预测(回归)和类估计(分类)组成的目标检测器。我们采用一种间接存储器访问架构来跳过零部分，并提出了权重并行二维卷积电路。它可以有效地应用于基于AlexNet的具有不同大小内核的CNN。我们设计了基于AlexNet的YOLOv2，以减少层数，实现低延迟计算。在实验中，提出的三态方案CNN将权重的内存大小减少了92%。我们在AvNet Inc.上实现了所提出的三状态权重YOLOv2。UltraZed-EG入门套件，有Xilinx Inc.。Zynq Ultrascale+ MPSoC ZU3EG。每秒存档61.70帧(FPS)，超过了标准视频帧率(29.97 FPS)。与ARM Cortex-A57相比，它的速度快了268.2倍，单位功率效率提高了313.51倍。与NVidia Pascal嵌入式GPU相比，速度提升4.0倍，功耗效率提升11.35倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 International Conference on Field-Programmable Technology (FPT)

自引率

0.00%

发文量