基于50.4 GOPs/W fpga的双层MAC和DSP效率增强的MobileNetV2加速器

Jixuan Li, Jiabao Chen, Ka-Fai Un, Wei-Han Yu, Pui-in Mak, R. Martins
{"title":"基于50.4 GOPs/W fpga的双层MAC和DSP效率增强的MobileNetV2加速器","authors":"Jixuan Li, Jiabao Chen, Ka-Fai Un, Wei-Han Yu, Pui-in Mak, R. Martins","doi":"10.1109/A-SSCC53895.2021.9634838","DOIUrl":null,"url":null,"abstract":"Convolutional neural network (CNN) models, e.g. MobileNetV2 [1] and Xception, are based on depthwise separable convolution. They exhibit over $40 \\times(64 \\times)$ reduction of the number of parameters (operations) when compared to the VGG16 for the ImageNet inference, while maintaining the TOP-1 accuracy at 72 %. With an 8-bit quantization, the required memory for storing the model can be further compressed by $4 \\times$. This multitude of model sizes compression facilitates real-time complex machine learning tasks implemented on a low-power FPGA apt for Internet-of-Things edge computation. Previous effect [2] has improved its computational energy efficiency by exploiting the model sparsity, but the effectiveness drops in already-compressed modern CNN models. As a result, further advancing the CNN accelerator’s energy efficiency with new techniques is desirable. [3] is a scalable adder tree for energy-efficient depthwise separable convolution computation, and [4] is a frame-rate enhancement technique; both failed to handle the extensive memory access during separable convolution that dominates the power consumption of the CNN accelerators. Herein we propose a double-layer multiply-accumulate (MAC) scheme to evaluate two layers within the bottleneck layer in a pipelining manner. It results significant reduction of the memory access to the feature maps. On top of that we also innovate a double-operation digital signal processor (DSP) to enhance the throughput of the accelerator by benefiting the use of a high-precision DSP for computing two fixed-point operations in one clock cycle.","PeriodicalId":286139,"journal":{"name":"2021 IEEE Asian Solid-State Circuits Conference (A-SSCC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"A 50.4 GOPs/W FPGA-Based MobileNetV2 Accelerator using the Double-Layer MAC and DSP Efficiency Enhancement\",\"authors\":\"Jixuan Li, Jiabao Chen, Ka-Fai Un, Wei-Han Yu, Pui-in Mak, R. Martins\",\"doi\":\"10.1109/A-SSCC53895.2021.9634838\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Convolutional neural network (CNN) models, e.g. MobileNetV2 [1] and Xception, are based on depthwise separable convolution. They exhibit over $40 \\\\times(64 \\\\times)$ reduction of the number of parameters (operations) when compared to the VGG16 for the ImageNet inference, while maintaining the TOP-1 accuracy at 72 %. With an 8-bit quantization, the required memory for storing the model can be further compressed by $4 \\\\times$. This multitude of model sizes compression facilitates real-time complex machine learning tasks implemented on a low-power FPGA apt for Internet-of-Things edge computation. Previous effect [2] has improved its computational energy efficiency by exploiting the model sparsity, but the effectiveness drops in already-compressed modern CNN models. As a result, further advancing the CNN accelerator’s energy efficiency with new techniques is desirable. [3] is a scalable adder tree for energy-efficient depthwise separable convolution computation, and [4] is a frame-rate enhancement technique; both failed to handle the extensive memory access during separable convolution that dominates the power consumption of the CNN accelerators. Herein we propose a double-layer multiply-accumulate (MAC) scheme to evaluate two layers within the bottleneck layer in a pipelining manner. It results significant reduction of the memory access to the feature maps. On top of that we also innovate a double-operation digital signal processor (DSP) to enhance the throughput of the accelerator by benefiting the use of a high-precision DSP for computing two fixed-point operations in one clock cycle.\",\"PeriodicalId\":286139,\"journal\":{\"name\":\"2021 IEEE Asian Solid-State Circuits Conference (A-SSCC)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-11-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE Asian Solid-State Circuits Conference (A-SSCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/A-SSCC53895.2021.9634838\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE Asian Solid-State Circuits Conference (A-SSCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/A-SSCC53895.2021.9634838","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

摘要

卷积神经网络(CNN)模型,如MobileNetV2[1]和exception,是基于深度可分离卷积的。与用于ImageNet推理的VGG16相比,它们的参数(操作)数量减少了40多倍(64倍),同时保持了72%的TOP-1精度。使用8位量化,存储模型所需的内存可以进一步压缩$4 \times$。这种大量的模型尺寸压缩有助于在适合物联网边缘计算的低功耗FPGA上实现实时复杂的机器学习任务。先前的效应[2]通过利用模型稀疏性提高了其计算能量效率,但是在已经压缩的现代CNN模型中效率下降。因此,用新技术进一步提高CNN加速器的能源效率是可取的。[3]是一种高效的深度可分离卷积计算的可扩展加法树,[4]是一种帧率增强技术;两者都不能处理在可分离卷积期间的大量内存访问,这在CNN加速器的功耗中占主导地位。在此,我们提出了一种双层多重累积(MAC)方案,以流水线方式评估瓶颈层内的两层。它显著减少了对特征映射的内存访问。除此之外,我们还创新了双操作数字信号处理器(DSP),通过使用高精度DSP在一个时钟周期内计算两个定点操作来提高加速器的吞吐量。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
A 50.4 GOPs/W FPGA-Based MobileNetV2 Accelerator using the Double-Layer MAC and DSP Efficiency Enhancement
Convolutional neural network (CNN) models, e.g. MobileNetV2 [1] and Xception, are based on depthwise separable convolution. They exhibit over $40 \times(64 \times)$ reduction of the number of parameters (operations) when compared to the VGG16 for the ImageNet inference, while maintaining the TOP-1 accuracy at 72 %. With an 8-bit quantization, the required memory for storing the model can be further compressed by $4 \times$. This multitude of model sizes compression facilitates real-time complex machine learning tasks implemented on a low-power FPGA apt for Internet-of-Things edge computation. Previous effect [2] has improved its computational energy efficiency by exploiting the model sparsity, but the effectiveness drops in already-compressed modern CNN models. As a result, further advancing the CNN accelerator’s energy efficiency with new techniques is desirable. [3] is a scalable adder tree for energy-efficient depthwise separable convolution computation, and [4] is a frame-rate enhancement technique; both failed to handle the extensive memory access during separable convolution that dominates the power consumption of the CNN accelerators. Herein we propose a double-layer multiply-accumulate (MAC) scheme to evaluate two layers within the bottleneck layer in a pipelining manner. It results significant reduction of the memory access to the feature maps. On top of that we also innovate a double-operation digital signal processor (DSP) to enhance the throughput of the accelerator by benefiting the use of a high-precision DSP for computing two fixed-point operations in one clock cycle.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信