DSSA: Dual-Side Sparse Systolic Array Architecture for Accelerating Convolutional Neural Network Training

Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI:10.1145/3545008.3545086

Zheng Chen, Qi Yu, Fang Zheng, F. Guo, Zuoning Chen

{"title":"DSSA: Dual-Side Sparse Systolic Array Architecture for Accelerating Convolutional Neural Network Training","authors":"Zheng Chen, Qi Yu, Fang Zheng, F. Guo, Zuoning Chen","doi":"10.1145/3545008.3545086","DOIUrl":null,"url":null,"abstract":"Ever-growing CNN size incurs a significant amount of redundancy in model parameters, which in turn, puts considerable burden on hardware. Unstructured pruning is widely used to reduce model sparsity. While, the irregularity introduced by unstructured pruning makes it difficult to accelerate sparse CNNs on systolic array. To address this issue, a variety of accelerators have been proposed. SIGMA, the state-of-the-art sparse GEMM accelerator, achieves significant speedup over systolic array. However, SIGMA suffers from two disadvantages: 1) it only supports one-side sparsity, leaving potential for further performance gains; 2) SIGMA improves utilization of large-sized systolic arrays at the cost of extra overhead. In this paper, we propose DSSA, a dual-side sparse systolic array, to accelerate CNN training. DSSA bases its designs on a small-sized systolic array, which naturally achieves higher cell utilization without additional overhead. To facilitate dual-side sparsity processing, DSSA utilizes a cross-cycle reduction module to accumulate partial sum that belongs to the same column but being processed in different cycles. A comprehensive design space exploration is performed to seek the local optimal configurations for DSSA. We implement the logic design of DSSA using Verilog in RTL and evaluate its performance using a C++-based cycle-accurate performance simulator we built. Experimental results show that DSSA delivers, on average, a speedup of 2.13x and 13.81x over SIGMA and a basic systolic array with the same number of cells. Compared to SIGMA, DSSA incurs 16.59% area overhead and 25.49% power overhead when sparse filter is excluded, as SIGMA did.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 51st International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3545008.3545086","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Ever-growing CNN size incurs a significant amount of redundancy in model parameters, which in turn, puts considerable burden on hardware. Unstructured pruning is widely used to reduce model sparsity. While, the irregularity introduced by unstructured pruning makes it difficult to accelerate sparse CNNs on systolic array. To address this issue, a variety of accelerators have been proposed. SIGMA, the state-of-the-art sparse GEMM accelerator, achieves significant speedup over systolic array. However, SIGMA suffers from two disadvantages: 1) it only supports one-side sparsity, leaving potential for further performance gains; 2) SIGMA improves utilization of large-sized systolic arrays at the cost of extra overhead. In this paper, we propose DSSA, a dual-side sparse systolic array, to accelerate CNN training. DSSA bases its designs on a small-sized systolic array, which naturally achieves higher cell utilization without additional overhead. To facilitate dual-side sparsity processing, DSSA utilizes a cross-cycle reduction module to accumulate partial sum that belongs to the same column but being processed in different cycles. A comprehensive design space exploration is performed to seek the local optimal configurations for DSSA. We implement the logic design of DSSA using Verilog in RTL and evaluate its performance using a C++-based cycle-accurate performance simulator we built. Experimental results show that DSSA delivers, on average, a speedup of 2.13x and 13.81x over SIGMA and a basic systolic array with the same number of cells. Compared to SIGMA, DSSA incurs 16.59% area overhead and 25.49% power overhead when sparse filter is excluded, as SIGMA did.

查看原文本刊更多论文

用于加速卷积神经网络训练的双侧稀疏收缩阵列架构

不断增长的CNN规模导致了大量的模型参数冗余，这反过来又给硬件带来了相当大的负担。非结构化剪枝被广泛用于降低模型稀疏度。然而，非结构化剪枝所带来的不规则性使得稀疏cnn在收缩阵列上的加速变得困难。为了解决这个问题，已经提出了各种各样的加速器。SIGMA，最先进的稀疏GEMM加速器，在收缩阵列上实现显著的加速。然而，SIGMA有两个缺点:1)它只支持单边稀疏性，为进一步的性能提升留下了潜力;2) SIGMA以额外的开销为代价提高了大型收缩阵列的利用率。本文提出一种双侧稀疏收缩阵列DSSA来加速CNN训练。DSSA的设计基于一个小尺寸的收缩阵列，自然可以在没有额外开销的情况下实现更高的细胞利用率。为了便于双边稀疏性处理，DSSA采用了交叉循环约简模块，对属于同一列但在不同循环中处理的部分和进行累加。为寻求DSSA的局部最优构型，进行了全面的设计空间探索。我们在RTL中使用Verilog实现了DSSA的逻辑设计，并使用我们构建的基于c++的周期精确性能模拟器对其性能进行了评估。实验结果表明，DSSA比SIGMA和相同细胞数量的基本收缩阵列平均加速2.13倍和13.81倍。与SIGMA相比，DSSA在排除稀疏滤波器时的面积开销为16.59%，功耗开销为25.49%，与SIGMA相同。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 51st International Conference on Parallel Processing

自引率

0.00%

发文量