利用激活中的动态比特稀疏性实现深度神经网络加速

2021 IEEE 14th International Conference on ASIC (ASICON) Pub Date : 2021-10-26 DOI:10.1109/ASICON52560.2021.9620448

Yongshuai Sun, Mengyuan Guo, Dacheng Liang, Shan Tang, Naifeng Jing

{"title":"利用激活中的动态比特稀疏性实现深度神经网络加速","authors":"Yongshuai Sun, Mengyuan Guo, Dacheng Liang, Shan Tang, Naifeng Jing","doi":"10.1109/ASICON52560.2021.9620448","DOIUrl":null,"url":null,"abstract":"Data sparsity is important in accelerating deep neural networks (DNNs). However, besides the zeroed values, the bit sparsity especially in activations are oftentimes missing in conventional DNN accelerators. In this paper, we present a DNN accelerator to exploit the bit sparsity by dynamically skipping zeroed bits in activations. To this goal, we first substitute the multiply-and-accumulate (MAC) units with more serial shift-and-accumulate units to sustain the computing parallelism. To prevent the low efficiency caused by the random number and positions of the zeroed bits in different activations, we propose activation-grouping, so that the activations in the same group can be computed on non-zero bits in different channels freely, and synchronization is only needed between different groups. We implement the proposed accelerator with 16 process units (PU) and 16 processing elements (PE) in each PU on FPGA built upon VTA (Versatile Tensor Accelerator) which can integrate seamlessly with TVM compilation. We evaluate the efficiency of our design with convolutional layers in resnet18 respectively, which achieves over 3.2x speedup on average compared with VTA design. In terms of the whole network, it can achieve over 2.26x speedup and over 2.0x improvement on area efficiency.","PeriodicalId":233584,"journal":{"name":"2021 IEEE 14th International Conference on ASIC (ASICON)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Exploiting Dynamic Bit Sparsity in Activation for Deep Neural Network Acceleration\",\"authors\":\"Yongshuai Sun, Mengyuan Guo, Dacheng Liang, Shan Tang, Naifeng Jing\",\"doi\":\"10.1109/ASICON52560.2021.9620448\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Data sparsity is important in accelerating deep neural networks (DNNs). However, besides the zeroed values, the bit sparsity especially in activations are oftentimes missing in conventional DNN accelerators. In this paper, we present a DNN accelerator to exploit the bit sparsity by dynamically skipping zeroed bits in activations. To this goal, we first substitute the multiply-and-accumulate (MAC) units with more serial shift-and-accumulate units to sustain the computing parallelism. To prevent the low efficiency caused by the random number and positions of the zeroed bits in different activations, we propose activation-grouping, so that the activations in the same group can be computed on non-zero bits in different channels freely, and synchronization is only needed between different groups. We implement the proposed accelerator with 16 process units (PU) and 16 processing elements (PE) in each PU on FPGA built upon VTA (Versatile Tensor Accelerator) which can integrate seamlessly with TVM compilation. We evaluate the efficiency of our design with convolutional layers in resnet18 respectively, which achieves over 3.2x speedup on average compared with VTA design. In terms of the whole network, it can achieve over 2.26x speedup and over 2.0x improvement on area efficiency.\",\"PeriodicalId\":233584,\"journal\":{\"name\":\"2021 IEEE 14th International Conference on ASIC (ASICON)\",\"volume\":\"9 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-10-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE 14th International Conference on ASIC (ASICON)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ASICON52560.2021.9620448\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE 14th International Conference on ASIC (ASICON)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASICON52560.2021.9620448","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

数据稀疏性在加速深度神经网络(dnn)中非常重要。然而，除了零值之外，在传统的深度神经网络加速器中，比特稀疏性(尤其是激活中的比特稀疏性)经常缺失。在本文中，我们提出了一个深度神经网络加速器，通过动态跳过激活中的零比特来利用比特稀疏性。为了实现这一目标，我们首先用更多的串行移位和累积单元替换乘法和累积(MAC)单元，以维持计算并行性。为了防止由于不同激活中零比特的随机数和位置的随机性而导致的效率低下，我们提出了激活分组，使得同一组中的激活可以在不同通道的非零比特上自由计算，并且只需要在不同组之间同步。我们在基于VTA(通用张量加速器)的FPGA上实现了16个处理单元(PU)和每个PU中的16个处理单元(PE)，该加速器可以与TVM编译无缝集成。我们分别在resnet18中使用卷积层来评估我们的设计效率，与VTA设计相比，平均速度提高了3.2倍以上。从全网来看，提速超过2.26倍，区域效率提升超过2.0倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Exploiting Dynamic Bit Sparsity in Activation for Deep Neural Network Acceleration

Data sparsity is important in accelerating deep neural networks (DNNs). However, besides the zeroed values, the bit sparsity especially in activations are oftentimes missing in conventional DNN accelerators. In this paper, we present a DNN accelerator to exploit the bit sparsity by dynamically skipping zeroed bits in activations. To this goal, we first substitute the multiply-and-accumulate (MAC) units with more serial shift-and-accumulate units to sustain the computing parallelism. To prevent the low efficiency caused by the random number and positions of the zeroed bits in different activations, we propose activation-grouping, so that the activations in the same group can be computed on non-zero bits in different channels freely, and synchronization is only needed between different groups. We implement the proposed accelerator with 16 process units (PU) and 16 processing elements (PE) in each PU on FPGA built upon VTA (Versatile Tensor Accelerator) which can integrate seamlessly with TVM compilation. We evaluate the efficiency of our design with convolutional layers in resnet18 respectively, which achieves over 3.2x speedup on average compared with VTA design. In terms of the whole network, it can achieve over 2.26x speedup and over 2.0x improvement on area efficiency.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 IEEE 14th International Conference on ASIC (ASICON)

自引率

0.00%

发文量