A 28nm 8-bit Floating-Point Tensor Core based CNN Training Processor with Dynamic Activation/Weight Sparsification

ESSCIRC 2022- IEEE 48th European Solid State Circuits Conference (ESSCIRC) Pub Date : 2022-09-19 DOI:10.1109/ESSCIRC55480.2022.9911359

S. Venkataramanaiah, Jian Meng, Han-Sok Suh, Injune Yeo, Jyotishman Saikia, Sai Kiran Cherupally, Yichi Zhang, Zhiru Zhang, J.-s. Seo

引用次数: 1

Abstract

We present an 8-bit floating-point (FP8) training processor which implements (1) highly parallel tensor cores (fused multiply-add trees) that maintain high utilization throughout forward propagation (FP), backward propagation (BP), and weight update (WU) phases of the training process, (2) hardware-efficient channel gating for dynamic output activation sparsity, (3) dynamic weight sparsity based on group Lasso, and (4) gradient skipping based on FP prediction error. We develop a custom ISA to flexibly support different CNN topologies and training parameters. The 28nm prototype chip demonstrates large improvements in FLOPs reduction (7.3 ×), energy efficiency (16.4 TFLOPS/W), and overall training latency speedup (4.7×), for both supervised and self-supervised training tasks.

查看原文本刊更多论文

基于28纳米8位浮点张量核心的CNN训练处理器，具有动态激活/权值稀疏化

我们提出了一种8位浮点(FP8)训练处理器，它实现了(1)在训练过程的前向传播(FP)、后向传播(BP)和权值更新(WU)阶段保持高利用率的高度并行张量核(融合乘加树)，(2)用于动态输出激活稀疏性的硬件高效通道门控，(3)基于组Lasso的动态权值稀疏性，以及(4)基于FP预测误差的梯度跳变。我们开发了一个自定义ISA来灵活地支持不同的CNN拓扑和训练参数。对于监督和自监督训练任务，28nm原型芯片在FLOPs降低(7.3 x)、能效(16.4 TFLOPS/W)和整体训练延迟加速(4.7 x)方面都有很大的改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ESSCIRC 2022- IEEE 48th European Solid State Circuits Conference (ESSCIRC)

自引率

0.00%

发文量