基于28纳米8位浮点张量核心的CNN训练处理器，具有动态激活/权值稀疏化

ESSCIRC 2022- IEEE 48th European Solid State Circuits Conference (ESSCIRC) Pub Date : 2022-09-19 DOI:10.1109/ESSCIRC55480.2022.9911359

S. Venkataramanaiah, Jian Meng, Han-Sok Suh, Injune Yeo, Jyotishman Saikia, Sai Kiran Cherupally, Yichi Zhang, Zhiru Zhang, J.-s. Seo

{"title":"基于28纳米8位浮点张量核心的CNN训练处理器，具有动态激活/权值稀疏化","authors":"S. Venkataramanaiah, Jian Meng, Han-Sok Suh, Injune Yeo, Jyotishman Saikia, Sai Kiran Cherupally, Yichi Zhang, Zhiru Zhang, J.-s. Seo","doi":"10.1109/ESSCIRC55480.2022.9911359","DOIUrl":null,"url":null,"abstract":"We present an 8-bit floating-point (FP8) training processor which implements (1) highly parallel tensor cores (fused multiply-add trees) that maintain high utilization throughout forward propagation (FP), backward propagation (BP), and weight update (WU) phases of the training process, (2) hardware-efficient channel gating for dynamic output activation sparsity, (3) dynamic weight sparsity based on group Lasso, and (4) gradient skipping based on FP prediction error. We develop a custom ISA to flexibly support different CNN topologies and training parameters. The 28nm prototype chip demonstrates large improvements in FLOPs reduction (7.3 ×), energy efficiency (16.4 TFLOPS/W), and overall training latency speedup (4.7×), for both supervised and self-supervised training tasks.","PeriodicalId":168466,"journal":{"name":"ESSCIRC 2022- IEEE 48th European Solid State Circuits Conference (ESSCIRC)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"A 28nm 8-bit Floating-Point Tensor Core based CNN Training Processor with Dynamic Activation/Weight Sparsification\",\"authors\":\"S. Venkataramanaiah, Jian Meng, Han-Sok Suh, Injune Yeo, Jyotishman Saikia, Sai Kiran Cherupally, Yichi Zhang, Zhiru Zhang, J.-s. Seo\",\"doi\":\"10.1109/ESSCIRC55480.2022.9911359\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We present an 8-bit floating-point (FP8) training processor which implements (1) highly parallel tensor cores (fused multiply-add trees) that maintain high utilization throughout forward propagation (FP), backward propagation (BP), and weight update (WU) phases of the training process, (2) hardware-efficient channel gating for dynamic output activation sparsity, (3) dynamic weight sparsity based on group Lasso, and (4) gradient skipping based on FP prediction error. We develop a custom ISA to flexibly support different CNN topologies and training parameters. The 28nm prototype chip demonstrates large improvements in FLOPs reduction (7.3 ×), energy efficiency (16.4 TFLOPS/W), and overall training latency speedup (4.7×), for both supervised and self-supervised training tasks.\",\"PeriodicalId\":168466,\"journal\":{\"name\":\"ESSCIRC 2022- IEEE 48th European Solid State Circuits Conference (ESSCIRC)\",\"volume\":\"19 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-09-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ESSCIRC 2022- IEEE 48th European Solid State Circuits Conference (ESSCIRC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ESSCIRC55480.2022.9911359\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ESSCIRC 2022- IEEE 48th European Solid State Circuits Conference (ESSCIRC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ESSCIRC55480.2022.9911359","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

我们提出了一种8位浮点(FP8)训练处理器，它实现了(1)在训练过程的前向传播(FP)、后向传播(BP)和权值更新(WU)阶段保持高利用率的高度并行张量核(融合乘加树)，(2)用于动态输出激活稀疏性的硬件高效通道门控，(3)基于组Lasso的动态权值稀疏性，以及(4)基于FP预测误差的梯度跳变。我们开发了一个自定义ISA来灵活地支持不同的CNN拓扑和训练参数。对于监督和自监督训练任务，28nm原型芯片在FLOPs降低(7.3 x)、能效(16.4 TFLOPS/W)和整体训练延迟加速(4.7 x)方面都有很大的改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A 28nm 8-bit Floating-Point Tensor Core based CNN Training Processor with Dynamic Activation/Weight Sparsification

We present an 8-bit floating-point (FP8) training processor which implements (1) highly parallel tensor cores (fused multiply-add trees) that maintain high utilization throughout forward propagation (FP), backward propagation (BP), and weight update (WU) phases of the training process, (2) hardware-efficient channel gating for dynamic output activation sparsity, (3) dynamic weight sparsity based on group Lasso, and (4) gradient skipping based on FP prediction error. We develop a custom ISA to flexibly support different CNN topologies and training parameters. The 28nm prototype chip demonstrates large improvements in FLOPs reduction (7.3 ×), energy efficiency (16.4 TFLOPS/W), and overall training latency speedup (4.7×), for both supervised and self-supervised training tasks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ESSCIRC 2022- IEEE 48th European Solid State Circuits Conference (ESSCIRC)

自引率

0.00%

发文量