Sparse-TPU: adapting systolic arrays for sparse matrices

Proceedings of the 34th ACM International Conference on Supercomputing Pub Date : 2020-06-29 DOI:10.1145/3392717.3392751

Xin He, S. Pal, Aporva Amarnath, Siying Feng, Dong-hyeon Park, A. Rovinski, Haojie Ye, Kuan-Yu Chen, R. Dreslinski, T. Mudge

{"title":"Sparse-TPU: adapting systolic arrays for sparse matrices","authors":"Xin He, S. Pal, Aporva Amarnath, Siying Feng, Dong-hyeon Park, A. Rovinski, Haojie Ye, Kuan-Yu Chen, R. Dreslinski, T. Mudge","doi":"10.1145/3392717.3392751","DOIUrl":null,"url":null,"abstract":"While systolic arrays are widely used for dense-matrix operations, they are seldom used for sparse-matrix operations. In this paper, we show how a systolic array of Multiply-and-Accumulate (MAC) units, similar to Google's Tensor Processing Unit (TPU), can be adapted to efficiently handle sparse matrices. TPU-like accelerators are built upon a 2D array of MAC units and have demonstrated high throughput and efficiency for dense matrix multiplication, which is a key kernel in machine learning algorithms and is the target of the TPU. In this work, we employ a co-designed approach of first developing a packing technique to condense a sparse matrix and then propose a systolic array based system, Sparse-TPU, abbreviated to STPU, to accommodate the matrix computations for the packed denser matrix counterparts. To demonstrate the efficacy of our co-designed approach, we evaluate sparse matrix-vector multiplication on a broad set of synthetic and real-world sparse matrices. Experimental results show that STPU delivers 16.08X higher performance while consuming 4.39X and 19.79X lower energy for integer (int8) and floating point (float32) implementations, respectively, over a TPU baseline. Meanwhile, STPU has 12.93% area overhead and an average of 4.14% increase in dynamic energy over the TPU baseline for the float32 implementation.","PeriodicalId":346687,"journal":{"name":"Proceedings of the 34th ACM International Conference on Supercomputing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"47","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 34th ACM International Conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3392717.3392751","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 47

Abstract

While systolic arrays are widely used for dense-matrix operations, they are seldom used for sparse-matrix operations. In this paper, we show how a systolic array of Multiply-and-Accumulate (MAC) units, similar to Google's Tensor Processing Unit (TPU), can be adapted to efficiently handle sparse matrices. TPU-like accelerators are built upon a 2D array of MAC units and have demonstrated high throughput and efficiency for dense matrix multiplication, which is a key kernel in machine learning algorithms and is the target of the TPU. In this work, we employ a co-designed approach of first developing a packing technique to condense a sparse matrix and then propose a systolic array based system, Sparse-TPU, abbreviated to STPU, to accommodate the matrix computations for the packed denser matrix counterparts. To demonstrate the efficacy of our co-designed approach, we evaluate sparse matrix-vector multiplication on a broad set of synthetic and real-world sparse matrices. Experimental results show that STPU delivers 16.08X higher performance while consuming 4.39X and 19.79X lower energy for integer (int8) and floating point (float32) implementations, respectively, over a TPU baseline. Meanwhile, STPU has 12.93% area overhead and an average of 4.14% increase in dynamic energy over the TPU baseline for the float32 implementation.

查看原文本刊更多论文

稀疏tpu:为稀疏矩阵调整收缩数组

收缩数组广泛用于密集矩阵运算，但很少用于稀疏矩阵运算。在本文中，我们展示了一个类似于谷歌的张量处理单元(TPU)的乘法累积(MAC)单元的收缩阵列如何被适应于有效地处理稀疏矩阵。类似TPU的加速器建立在MAC单元的二维阵列上，并且已经证明了密集矩阵乘法的高吞吐量和效率，这是机器学习算法的关键内核，也是TPU的目标。在这项工作中，我们采用一种共同设计的方法，首先开发一种封装技术来压缩稀疏矩阵，然后提出一个基于收缩阵列的系统，sparse - tpu，缩写为STPU，以适应填充密集矩阵对应的矩阵计算。为了证明我们共同设计的方法的有效性，我们在一组广泛的合成和现实世界的稀疏矩阵上评估稀疏矩阵-向量乘法。实验结果表明，在TPU基准上，STPU提供了16.08倍的性能提升，而在整数(int8)和浮点(float32)实现上分别消耗了4.39倍和19.79倍的能量。同时，与float32实现的TPU基线相比，STPU的面积开销为12.93%，动态能量平均增加4.14%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 34th ACM International Conference on Supercomputing

自引率

0.00%

发文量