Jixuan Li, Jiabao Chen, Ka-Fai Un, Wei-Han Yu, Pui-in Mak, R. Martins
{"title":"基于50.4 GOPs/W fpga的双层MAC和DSP效率增强的MobileNetV2加速器","authors":"Jixuan Li, Jiabao Chen, Ka-Fai Un, Wei-Han Yu, Pui-in Mak, R. Martins","doi":"10.1109/A-SSCC53895.2021.9634838","DOIUrl":null,"url":null,"abstract":"Convolutional neural network (CNN) models, e.g. MobileNetV2 [1] and Xception, are based on depthwise separable convolution. They exhibit over $40 \\times(64 \\times)$ reduction of the number of parameters (operations) when compared to the VGG16 for the ImageNet inference, while maintaining the TOP-1 accuracy at 72 %. With an 8-bit quantization, the required memory for storing the model can be further compressed by $4 \\times$. This multitude of model sizes compression facilitates real-time complex machine learning tasks implemented on a low-power FPGA apt for Internet-of-Things edge computation. Previous effect [2] has improved its computational energy efficiency by exploiting the model sparsity, but the effectiveness drops in already-compressed modern CNN models. As a result, further advancing the CNN accelerator’s energy efficiency with new techniques is desirable. [3] is a scalable adder tree for energy-efficient depthwise separable convolution computation, and [4] is a frame-rate enhancement technique; both failed to handle the extensive memory access during separable convolution that dominates the power consumption of the CNN accelerators. Herein we propose a double-layer multiply-accumulate (MAC) scheme to evaluate two layers within the bottleneck layer in a pipelining manner. It results significant reduction of the memory access to the feature maps. On top of that we also innovate a double-operation digital signal processor (DSP) to enhance the throughput of the accelerator by benefiting the use of a high-precision DSP for computing two fixed-point operations in one clock cycle.","PeriodicalId":286139,"journal":{"name":"2021 IEEE Asian Solid-State Circuits Conference (A-SSCC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"A 50.4 GOPs/W FPGA-Based MobileNetV2 Accelerator using the Double-Layer MAC and DSP Efficiency Enhancement\",\"authors\":\"Jixuan Li, Jiabao Chen, Ka-Fai Un, Wei-Han Yu, Pui-in Mak, R. Martins\",\"doi\":\"10.1109/A-SSCC53895.2021.9634838\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Convolutional neural network (CNN) models, e.g. MobileNetV2 [1] and Xception, are based on depthwise separable convolution. They exhibit over $40 \\\\times(64 \\\\times)$ reduction of the number of parameters (operations) when compared to the VGG16 for the ImageNet inference, while maintaining the TOP-1 accuracy at 72 %. With an 8-bit quantization, the required memory for storing the model can be further compressed by $4 \\\\times$. This multitude of model sizes compression facilitates real-time complex machine learning tasks implemented on a low-power FPGA apt for Internet-of-Things edge computation. Previous effect [2] has improved its computational energy efficiency by exploiting the model sparsity, but the effectiveness drops in already-compressed modern CNN models. As a result, further advancing the CNN accelerator’s energy efficiency with new techniques is desirable. [3] is a scalable adder tree for energy-efficient depthwise separable convolution computation, and [4] is a frame-rate enhancement technique; both failed to handle the extensive memory access during separable convolution that dominates the power consumption of the CNN accelerators. Herein we propose a double-layer multiply-accumulate (MAC) scheme to evaluate two layers within the bottleneck layer in a pipelining manner. It results significant reduction of the memory access to the feature maps. On top of that we also innovate a double-operation digital signal processor (DSP) to enhance the throughput of the accelerator by benefiting the use of a high-precision DSP for computing two fixed-point operations in one clock cycle.\",\"PeriodicalId\":286139,\"journal\":{\"name\":\"2021 IEEE Asian Solid-State Circuits Conference (A-SSCC)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-11-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE Asian Solid-State Circuits Conference (A-SSCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/A-SSCC53895.2021.9634838\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE Asian Solid-State Circuits Conference (A-SSCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/A-SSCC53895.2021.9634838","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A 50.4 GOPs/W FPGA-Based MobileNetV2 Accelerator using the Double-Layer MAC and DSP Efficiency Enhancement
Convolutional neural network (CNN) models, e.g. MobileNetV2 [1] and Xception, are based on depthwise separable convolution. They exhibit over $40 \times(64 \times)$ reduction of the number of parameters (operations) when compared to the VGG16 for the ImageNet inference, while maintaining the TOP-1 accuracy at 72 %. With an 8-bit quantization, the required memory for storing the model can be further compressed by $4 \times$. This multitude of model sizes compression facilitates real-time complex machine learning tasks implemented on a low-power FPGA apt for Internet-of-Things edge computation. Previous effect [2] has improved its computational energy efficiency by exploiting the model sparsity, but the effectiveness drops in already-compressed modern CNN models. As a result, further advancing the CNN accelerator’s energy efficiency with new techniques is desirable. [3] is a scalable adder tree for energy-efficient depthwise separable convolution computation, and [4] is a frame-rate enhancement technique; both failed to handle the extensive memory access during separable convolution that dominates the power consumption of the CNN accelerators. Herein we propose a double-layer multiply-accumulate (MAC) scheme to evaluate two layers within the bottleneck layer in a pipelining manner. It results significant reduction of the memory access to the feature maps. On top of that we also innovate a double-operation digital signal processor (DSP) to enhance the throughput of the accelerator by benefiting the use of a high-precision DSP for computing two fixed-point operations in one clock cycle.