An Energy-Efficient Bit-Split-and-Combination Systolic Accelerator for NAS-Based Multi-Precision Convolution Neural Networks

L. Dai, Quanbao Cheng, Yuhang Wang, Gengbin Huang, Junzhuo Zhou, Kai Li, Wei Mao, Hao Yu
{"title":"An Energy-Efficient Bit-Split-and-Combination Systolic Accelerator for NAS-Based Multi-Precision Convolution Neural Networks","authors":"L. Dai, Quanbao Cheng, Yuhang Wang, Gengbin Huang, Junzhuo Zhou, Kai Li, Wei Mao, Hao Yu","doi":"10.1109/asp-dac52403.2022.9712509","DOIUrl":null,"url":null,"abstract":"Optimized convolutional neural network (CNN) models and energy-efficient hardware design are of great importance in edge-computing applications. The neural architecture search (NAS) methods are employed for CNN model optimization with multi-precision networks. To satisfy the computation requirements, multi-precision convolution accelerators are highly desired. The existing high-precision-split (HPS) designs reduce the additional logics for reconfiguration while resulting in low throughput for low precisions. The low-precision-combination (LPC) designs improve the low-precision throughput with large hardware cost. In this work, a bit-split-and-combination (BSC) systolic accelerator is proposed to overcome the bottlenecks. Firstly, BSC-based multiply-accumulate (MAC) unit is designed to support multi-precision computation operations. Secondly, multi-precision systolic dataflow is developed with improved data-reuse and transmission efficiency. The proposed work is designed by Chisel and synthesized in 28-nm process. The BSC MAC unit achieves maximum 2.40× and 1.64× energy efficiency than HPS and LPC units, respectively. Compared with published accelerator designs Gemmini, Bit-fusion and Bit-serial, the proposed accelerator achieves up to 2.94 × area efficiency and 6.38 × energy-saving performance on the multi-precision VGG-16, ResNet-18 and LeNet-5 benchmarks.","PeriodicalId":239260,"journal":{"name":"2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/asp-dac52403.2022.9712509","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

Optimized convolutional neural network (CNN) models and energy-efficient hardware design are of great importance in edge-computing applications. The neural architecture search (NAS) methods are employed for CNN model optimization with multi-precision networks. To satisfy the computation requirements, multi-precision convolution accelerators are highly desired. The existing high-precision-split (HPS) designs reduce the additional logics for reconfiguration while resulting in low throughput for low precisions. The low-precision-combination (LPC) designs improve the low-precision throughput with large hardware cost. In this work, a bit-split-and-combination (BSC) systolic accelerator is proposed to overcome the bottlenecks. Firstly, BSC-based multiply-accumulate (MAC) unit is designed to support multi-precision computation operations. Secondly, multi-precision systolic dataflow is developed with improved data-reuse and transmission efficiency. The proposed work is designed by Chisel and synthesized in 28-nm process. The BSC MAC unit achieves maximum 2.40× and 1.64× energy efficiency than HPS and LPC units, respectively. Compared with published accelerator designs Gemmini, Bit-fusion and Bit-serial, the proposed accelerator achieves up to 2.94 × area efficiency and 6.38 × energy-saving performance on the multi-precision VGG-16, ResNet-18 and LeNet-5 benchmarks.
基于nas的多精度卷积神经网络的高效位分割组合收缩加速器
优化卷积神经网络(CNN)模型和节能硬件设计在边缘计算应用中具有重要意义。将神经结构搜索(NAS)方法应用于多精度网络的CNN模型优化。为了满足计算要求,需要多精度的卷积加速器。现有的高精度分割(HPS)设计减少了重新配置的额外逻辑,同时导致低精度的低吞吐量。低精度组合(LPC)设计以较高的硬件成本提高了低精度吞吐量。在这项工作中,提出了一种比特分裂和组合(BSC)收缩加速器来克服瓶颈。首先,设计了基于bsc的多重累积(MAC)单元,以支持多精度的计算操作;其次,开发了多精度的收缩数据流,提高了数据复用和传输效率。该作品由Chisel设计,并在28nm工艺下合成。与HPS和LPC相比,BSC MAC单元的最大能效分别为2.40倍和1.64倍。在多精度VGG-16、ResNet-18和LeNet-5基准测试上,与已有的gemini、Bit-fusion和Bit-serial加速器设计相比,该加速器的面积效率高达2.94倍,节能性能为6.38倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信