tubGEMM:高效的稀疏有效的时间一元二元矩阵乘法单元

2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI) Pub Date : 2023-06-20 DOI:10.1109/ISVLSI59464.2023.10238524

P. Vellaisamy, Harideep Nair, Joseph Finn, Manav Trivedi, Albert Chen, Anna Li, Tsung-Han Lin, Perry Wang, Shawn Blanton, John Paul Shen

{"title":"tubGEMM:高效的稀疏有效的时间一元二元矩阵乘法单元","authors":"P. Vellaisamy, Harideep Nair, Joseph Finn, Manav Trivedi, Albert Chen, Anna Li, Tsung-Han Lin, Perry Wang, Shawn Blanton, John Paul Shen","doi":"10.1109/ISVLSI59464.2023.10238524","DOIUrl":null,"url":null,"abstract":"General Matrix Multiplication (GEMM) is a ubiquitous compute kernel in deep learning (DL). To support energy-efficient edge-native processing, new GEMM hardware units have been proposed that operate on unary encoded bitstreams using much simpler hardware. Most unary approaches thus far focus on rate-based unary encoding of values and perform stochastic approximate computation. This work presents tubGEMM, a novel matrix-multiply unit design that employs hybrid temporal-unary and binary (tub) encoding and performs exact (not approximate) GEMM. It intrinsically exploits dynamic value sparsity to improve energy efficiency. Compared to the current best unary design uGEMM, tubGEMM significantly reduces area, power, and energy by 89%, 87%, and 50% respectively. A tubGEMM design performing 128x128 matrix multiply on 8-bit integers, in commercial TSMC N5 (5nm) process node, consumes just 0.22 m$\\mathrm{m}^{2}$ die area, 417.72 mW power, and 8.86 $\\mu$J energy, assuming no sparsity. Typical sparsity in DL workloads (MobileNetv2, ResNet50) reduces energy by more than 3x, and lowering precision to 4 and 2 bits further reduces it by 24x and 104x respectively.","PeriodicalId":199371,"journal":{"name":"2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"tubGEMM: Energy-Efficient and Sparsity-Effective Temporal-Unary-Binary Based Matrix Multiply Unit\",\"authors\":\"P. Vellaisamy, Harideep Nair, Joseph Finn, Manav Trivedi, Albert Chen, Anna Li, Tsung-Han Lin, Perry Wang, Shawn Blanton, John Paul Shen\",\"doi\":\"10.1109/ISVLSI59464.2023.10238524\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"General Matrix Multiplication (GEMM) is a ubiquitous compute kernel in deep learning (DL). To support energy-efficient edge-native processing, new GEMM hardware units have been proposed that operate on unary encoded bitstreams using much simpler hardware. Most unary approaches thus far focus on rate-based unary encoding of values and perform stochastic approximate computation. This work presents tubGEMM, a novel matrix-multiply unit design that employs hybrid temporal-unary and binary (tub) encoding and performs exact (not approximate) GEMM. It intrinsically exploits dynamic value sparsity to improve energy efficiency. Compared to the current best unary design uGEMM, tubGEMM significantly reduces area, power, and energy by 89%, 87%, and 50% respectively. A tubGEMM design performing 128x128 matrix multiply on 8-bit integers, in commercial TSMC N5 (5nm) process node, consumes just 0.22 m$\\\\mathrm{m}^{2}$ die area, 417.72 mW power, and 8.86 $\\\\mu$J energy, assuming no sparsity. Typical sparsity in DL workloads (MobileNetv2, ResNet50) reduces energy by more than 3x, and lowering precision to 4 and 2 bits further reduces it by 24x and 104x respectively.\",\"PeriodicalId\":199371,\"journal\":{\"name\":\"2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-06-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISVLSI59464.2023.10238524\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISVLSI59464.2023.10238524","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

通用矩阵乘法(GEMM)是深度学习中的泛在计算内核。为了支持高效的边缘原生处理，新的GEMM硬件单元被提出，使用更简单的硬件在一元编码的比特流上运行。到目前为止，大多数一元方法都集中在基于速率的值的一元编码上，并进行随机近似计算。这项工作提出了tubGEMM，一种新颖的矩阵乘单元设计，采用混合时间一元和二进制(tub)编码，并执行精确(而不是近似)GEMM。它本质上利用动态值稀疏性来提高能源效率。与目前最好的一元设计uGEMM相比，tubGEMM的面积、功耗和能耗分别显著降低89%、87%和50%。在商用台积电N5 (5nm)制程节点上，对8位整数执行128x128矩阵乘法的tubGEMM设计仅消耗0.22 m$\ mathm {m}^{2}$芯片面积，417.72 mW功率和8.86 $\mu$J能量，假设没有稀疏性。典型的深度学习工作负载(MobileNetv2, ResNet50)的稀疏性减少了3倍以上的能量，将精度降低到4位和2位进一步减少了24倍和104倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

tubGEMM: Energy-Efficient and Sparsity-Effective Temporal-Unary-Binary Based Matrix Multiply Unit

General Matrix Multiplication (GEMM) is a ubiquitous compute kernel in deep learning (DL). To support energy-efficient edge-native processing, new GEMM hardware units have been proposed that operate on unary encoded bitstreams using much simpler hardware. Most unary approaches thus far focus on rate-based unary encoding of values and perform stochastic approximate computation. This work presents tubGEMM, a novel matrix-multiply unit design that employs hybrid temporal-unary and binary (tub) encoding and performs exact (not approximate) GEMM. It intrinsically exploits dynamic value sparsity to improve energy efficiency. Compared to the current best unary design uGEMM, tubGEMM significantly reduces area, power, and energy by 89%, 87%, and 50% respectively. A tubGEMM design performing 128x128 matrix multiply on 8-bit integers, in commercial TSMC N5 (5nm) process node, consumes just 0.22 m$\mathrm{m}^{2}$ die area, 417.72 mW power, and 8.86 $\mu$J energy, assuming no sparsity. Typical sparsity in DL workloads (MobileNetv2, ResNet50) reduces energy by more than 3x, and lowering precision to 4 and 2 bits further reduces it by 24x and 104x respectively.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)

自引率

0.00%

发文量