Bo Liu;Xinxiang Huang;Yang Zhang;Guang Yang;Han Yan;Chen Zhang;Zejv Li;Yuanhao Wang;Hao Cai
{"title":"Layer-Wise Mixed-Modes CNN Processing Architecture With Double-Stationary Dataflow and Dimension-Reshape Strategy","authors":"Bo Liu;Xinxiang Huang;Yang Zhang;Guang Yang;Han Yan;Chen Zhang;Zejv Li;Yuanhao Wang;Hao Cai","doi":"10.1109/TCSI.2024.3434706","DOIUrl":null,"url":null,"abstract":"With the development of convolutional neural networks (CNN) across various domains, the growth in network structure complexity and computational load has increasingly become a research focus in the deployment of neural networks. The key to current research on neural network accelerators lies in striking a balance between computational accuracy and energy efficiency. This paper proposes a software-hardware co-design to strike the balance for CNN edge applications. On the hardware side, a 3-dimensional tensor engine (3D-TE), achieved with reconfigurable Tensor Processing Units (TPUs), is introduced for efficient convolution computation. We optimize the CNN dataflow on 3D-TE using a dimension reshaping method for feature maps rearrangement, and a double stationary dataflow scheduling to reduce memory access. This paper adopts a configurable approximate multiplier design based on Boolean Matrix Factorization (BMF) based logic synthesis applied in the architecture of TPU. The proposed 3D-TE, characterized by its configurable precision, enables the TPUs to dynamically adapt the bitwidth of features and weights in response to varying precision requirements. On the software side, a hessian-guided layer precision mapping is adopted to reduce unnecessary computational overhead, and a progressive re-training approach is proposed to enable a better approximation configuration and higher power reduction. Fabricated on 28-nm CMOS, this work achieves an optimized energy efficiency of 14.9 TOPS/W and 12.1 TOPS/W for ResNet56 and MobileNetV2 respectively, with 0.6V supply voltage and 150MHz clock frequency, representing an improvement of \n<inline-formula> <tex-math>$1.33\\times \\sim 8.28\\times $ </tex-math></inline-formula>\n over the state-of-the-art works.","PeriodicalId":13039,"journal":{"name":"IEEE Transactions on Circuits and Systems I: Regular Papers","volume":null,"pages":null},"PeriodicalIF":5.2000,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems I: Regular Papers","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10620270/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
With the development of convolutional neural networks (CNN) across various domains, the growth in network structure complexity and computational load has increasingly become a research focus in the deployment of neural networks. The key to current research on neural network accelerators lies in striking a balance between computational accuracy and energy efficiency. This paper proposes a software-hardware co-design to strike the balance for CNN edge applications. On the hardware side, a 3-dimensional tensor engine (3D-TE), achieved with reconfigurable Tensor Processing Units (TPUs), is introduced for efficient convolution computation. We optimize the CNN dataflow on 3D-TE using a dimension reshaping method for feature maps rearrangement, and a double stationary dataflow scheduling to reduce memory access. This paper adopts a configurable approximate multiplier design based on Boolean Matrix Factorization (BMF) based logic synthesis applied in the architecture of TPU. The proposed 3D-TE, characterized by its configurable precision, enables the TPUs to dynamically adapt the bitwidth of features and weights in response to varying precision requirements. On the software side, a hessian-guided layer precision mapping is adopted to reduce unnecessary computational overhead, and a progressive re-training approach is proposed to enable a better approximation configuration and higher power reduction. Fabricated on 28-nm CMOS, this work achieves an optimized energy efficiency of 14.9 TOPS/W and 12.1 TOPS/W for ResNet56 and MobileNetV2 respectively, with 0.6V supply voltage and 150MHz clock frequency, representing an improvement of
$1.33\times \sim 8.28\times $
over the state-of-the-art works.
期刊介绍:
TCAS I publishes regular papers in the field specified by the theory, analysis, design, and practical implementations of circuits, and the application of circuit techniques to systems and to signal processing. Included is the whole spectrum from basic scientific theory to industrial applications. The field of interest covered includes: - Circuits: Analog, Digital and Mixed Signal Circuits and Systems - Nonlinear Circuits and Systems, Integrated Sensors, MEMS and Systems on Chip, Nanoscale Circuits and Systems, Optoelectronic - Circuits and Systems, Power Electronics and Systems - Software for Analog-and-Logic Circuits and Systems - Control aspects of Circuits and Systems.