A 50.4 GOPs/W FPGA-Based MobileNetV2 Accelerator using the Double-Layer MAC and DSP Efficiency Enhancement

2021 IEEE Asian Solid-State Circuits Conference (A-SSCC) Pub Date : 2021-11-07 DOI:10.1109/A-SSCC53895.2021.9634838

Jixuan Li, Jiabao Chen, Ka-Fai Un, Wei-Han Yu, Pui-in Mak, R. Martins

{"title":"A 50.4 GOPs/W FPGA-Based MobileNetV2 Accelerator using the Double-Layer MAC and DSP Efficiency Enhancement","authors":"Jixuan Li, Jiabao Chen, Ka-Fai Un, Wei-Han Yu, Pui-in Mak, R. Martins","doi":"10.1109/A-SSCC53895.2021.9634838","DOIUrl":null,"url":null,"abstract":"Convolutional neural network (CNN) models, e.g. MobileNetV2 [1] and Xception, are based on depthwise separable convolution. They exhibit over $40 \\times(64 \\times)$ reduction of the number of parameters (operations) when compared to the VGG16 for the ImageNet inference, while maintaining the TOP-1 accuracy at 72 %. With an 8-bit quantization, the required memory for storing the model can be further compressed by $4 \\times$. This multitude of model sizes compression facilitates real-time complex machine learning tasks implemented on a low-power FPGA apt for Internet-of-Things edge computation. Previous effect [2] has improved its computational energy efficiency by exploiting the model sparsity, but the effectiveness drops in already-compressed modern CNN models. As a result, further advancing the CNN accelerator’s energy efficiency with new techniques is desirable. [3] is a scalable adder tree for energy-efficient depthwise separable convolution computation, and [4] is a frame-rate enhancement technique; both failed to handle the extensive memory access during separable convolution that dominates the power consumption of the CNN accelerators. Herein we propose a double-layer multiply-accumulate (MAC) scheme to evaluate two layers within the bottleneck layer in a pipelining manner. It results significant reduction of the memory access to the feature maps. On top of that we also innovate a double-operation digital signal processor (DSP) to enhance the throughput of the accelerator by benefiting the use of a high-precision DSP for computing two fixed-point operations in one clock cycle.","PeriodicalId":286139,"journal":{"name":"2021 IEEE Asian Solid-State Circuits Conference (A-SSCC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE Asian Solid-State Circuits Conference (A-SSCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/A-SSCC53895.2021.9634838","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

Convolutional neural network (CNN) models, e.g. MobileNetV2 [1] and Xception, are based on depthwise separable convolution. They exhibit over $40 \times(64 \times)$ reduction of the number of parameters (operations) when compared to the VGG16 for the ImageNet inference, while maintaining the TOP-1 accuracy at 72 %. With an 8-bit quantization, the required memory for storing the model can be further compressed by $4 \times$. This multitude of model sizes compression facilitates real-time complex machine learning tasks implemented on a low-power FPGA apt for Internet-of-Things edge computation. Previous effect [2] has improved its computational energy efficiency by exploiting the model sparsity, but the effectiveness drops in already-compressed modern CNN models. As a result, further advancing the CNN accelerator’s energy efficiency with new techniques is desirable. [3] is a scalable adder tree for energy-efficient depthwise separable convolution computation, and [4] is a frame-rate enhancement technique; both failed to handle the extensive memory access during separable convolution that dominates the power consumption of the CNN accelerators. Herein we propose a double-layer multiply-accumulate (MAC) scheme to evaluate two layers within the bottleneck layer in a pipelining manner. It results significant reduction of the memory access to the feature maps. On top of that we also innovate a double-operation digital signal processor (DSP) to enhance the throughput of the accelerator by benefiting the use of a high-precision DSP for computing two fixed-point operations in one clock cycle.

查看原文本刊更多论文

基于50.4 GOPs/W fpga的双层MAC和DSP效率增强的MobileNetV2加速器

卷积神经网络(CNN)模型，如MobileNetV2[1]和exception，是基于深度可分离卷积的。与用于ImageNet推理的VGG16相比，它们的参数(操作)数量减少了40多倍(64倍)，同时保持了72%的TOP-1精度。使用8位量化，存储模型所需的内存可以进一步压缩$4 \times$。这种大量的模型尺寸压缩有助于在适合物联网边缘计算的低功耗FPGA上实现实时复杂的机器学习任务。先前的效应[2]通过利用模型稀疏性提高了其计算能量效率，但是在已经压缩的现代CNN模型中效率下降。因此，用新技术进一步提高CNN加速器的能源效率是可取的。[3]是一种高效的深度可分离卷积计算的可扩展加法树，[4]是一种帧率增强技术;两者都不能处理在可分离卷积期间的大量内存访问，这在CNN加速器的功耗中占主导地位。在此，我们提出了一种双层多重累积(MAC)方案，以流水线方式评估瓶颈层内的两层。它显著减少了对特征映射的内存访问。除此之外，我们还创新了双操作数字信号处理器(DSP)，通过使用高精度DSP在一个时钟周期内计算两个定点操作来提高加速器的吞吐量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE Asian Solid-State Circuits Conference (A-SSCC)

自引率

0.00%

发文量