SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2020-02-01 DOI:10.1109/HPCA47549.2020.00015

Eric Qin, A. Samajdar, Hyoukjun Kwon, V. Nadella, S. Srinivasan, Dipankar Das, Bharat Kaul, T. Krishna

{"title":"SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training","authors":"Eric Qin, A. Samajdar, Hyoukjun Kwon, V. Nadella, S. Srinivasan, Dipankar Das, Bharat Kaul, T. Krishna","doi":"10.1109/HPCA47549.2020.00015","DOIUrl":null,"url":null,"abstract":"The advent of Deep Learning (DL) has radically transformed the computing industry across the entire spectrum from algorithms to circuits. As myriad application domains embrace DL, it has become synonymous with a genre of workloads across vision, speech, language, recommendations, robotics, and games. The key compute kernel within most DL workloads is general matrix-matrix multiplications (GEMMs), which appears frequently during both the forward pass (inference and training) and backward pass (training). GEMMs are a natural choice for hardware acceleration to speed up training, and have led to 2D systolic architectures like NVIDIA tensor cores and Google Tensor Processing Unit (TPU). Unfortunately, emerging GEMMs in DL are highly irregular and sparse, which lead to poor data mappings on systolic architectures. This paper proposes SIGMA, a flexible and scalable architecture that offers high utilization of all its processing elements (PEs) regardless of kernel shape and sparsity. Within SIGMA includes a novel reduction tree microarchitecture named Forwarding Adder Network (FAN). SIGMA performs 5.7x better than systolic array architectures for irregular sparse matrices, and roughly 3x better than state-of-the-art sparse accelerators. We demonstrate an instance of SIGMA operating at 10.8 TFLOPS efficiency across arbitrary levels of sparsity, with a 65.10 mm^2 and 22.33 W footprint on a 28 nm process.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"256","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCA47549.2020.00015","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 256

Abstract

The advent of Deep Learning (DL) has radically transformed the computing industry across the entire spectrum from algorithms to circuits. As myriad application domains embrace DL, it has become synonymous with a genre of workloads across vision, speech, language, recommendations, robotics, and games. The key compute kernel within most DL workloads is general matrix-matrix multiplications (GEMMs), which appears frequently during both the forward pass (inference and training) and backward pass (training). GEMMs are a natural choice for hardware acceleration to speed up training, and have led to 2D systolic architectures like NVIDIA tensor cores and Google Tensor Processing Unit (TPU). Unfortunately, emerging GEMMs in DL are highly irregular and sparse, which lead to poor data mappings on systolic architectures. This paper proposes SIGMA, a flexible and scalable architecture that offers high utilization of all its processing elements (PEs) regardless of kernel shape and sparsity. Within SIGMA includes a novel reduction tree microarchitecture named Forwarding Adder Network (FAN). SIGMA performs 5.7x better than systolic array architectures for irregular sparse matrices, and roughly 3x better than state-of-the-art sparse accelerators. We demonstrate an instance of SIGMA operating at 10.8 TFLOPS efficiency across arbitrary levels of sparsity, with a 65.10 mm^2 and 22.33 W footprint on a 28 nm process.

查看原文本刊更多论文

SIGMA:用于深度神经网络训练的具有柔性互连的稀疏和不规则gem加速器

深度学习(DL)的出现从根本上改变了整个计算行业，从算法到电路。随着无数应用程序领域接受深度学习，它已经成为跨视觉、语音、语言、推荐、机器人和游戏的一种工作负载类型的代名词。大多数深度学习工作负载中的关键计算内核是一般矩阵-矩阵乘法(gemm)，它在前向传递(推理和训练)和后向传递(训练)期间经常出现。gem是硬件加速加速训练的自然选择，并导致了2D收缩架构，如NVIDIA张量内核和Google张量处理单元(TPU)。不幸的是，DL中新兴的gem是高度不规则和稀疏的，这导致收缩架构上的数据映射很差。本文提出了一个灵活和可扩展的体系结构SIGMA，它提供了所有处理元素(pe)的高利用率，而不考虑内核的形状和稀疏性。在SIGMA中包括一种名为转发加法网(FAN)的新型约简树微架构。对于不规则稀疏矩阵，SIGMA的性能比收缩阵列架构好5.7倍，比最先进的稀疏加速器大约好3倍。我们展示了一个SIGMA在任意稀疏度水平上以10.8 TFLOPS效率运行的实例，在28 nm工艺上具有65.10 mm^2和22.33 W的占地面积。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)

自引率

0.00%

发文量