{"title":"A Low-Bit Quantized and HLS-Based Neural Network FPGA Accelerator for Object Detection","authors":"Jiaming Huang, Junyan Yang, Saisai Nui, Hang Yi, Wei Wang, Hai-Bao Chen","doi":"10.1109/CSTIC52283.2021.9461256","DOIUrl":null,"url":null,"abstract":"In this paper, a HLS-based convolutional neural network (CNN) accelerator is designed for FPGA and channel-wise low-bit quantization is applied to YOLOv3- Tiny, whose weights are quantized to 2-bit while activations are quantized to 8-bit. The quantization range is learnable in training to prevent severe accuracy loss. The accelerator uses sliding window technique to improve data reusability and efficient process element (PE) is designed to utilize low-bit calculation. This design makes full use of DSP and LUT resources and exploits optimal parallelism on embedded FPGA. The performance of our design can reach 90.6 GOP/s on PYNQ-Z2 at 150 MHz, which outperforms other accelerators implemented on the same platform in terms of peak performance and power efficiency.","PeriodicalId":186529,"journal":{"name":"2021 China Semiconductor Technology International Conference (CSTIC)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 China Semiconductor Technology International Conference (CSTIC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CSTIC52283.2021.9461256","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
In this paper, a HLS-based convolutional neural network (CNN) accelerator is designed for FPGA and channel-wise low-bit quantization is applied to YOLOv3- Tiny, whose weights are quantized to 2-bit while activations are quantized to 8-bit. The quantization range is learnable in training to prevent severe accuracy loss. The accelerator uses sliding window technique to improve data reusability and efficient process element (PE) is designed to utilize low-bit calculation. This design makes full use of DSP and LUT resources and exploits optimal parallelism on embedded FPGA. The performance of our design can reach 90.6 GOP/s on PYNQ-Z2 at 150 MHz, which outperforms other accelerators implemented on the same platform in terms of peak performance and power efficiency.