IOPS: A Unified SpMM Accelerator Based on Inner-Outer-Hybrid Product

IF 3.8 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers Pub Date : 2025-04-04 DOI:10.1109/TC.2025.3558013

Wenhao Sun;Wendi Sun;Song Chen;Yi Kang

{"title":"IOPS: A Unified SpMM Accelerator Based on Inner-Outer-Hybrid Product","authors":"Wenhao Sun;Wendi Sun;Song Chen;Yi Kang","doi":"10.1109/TC.2025.3558013","DOIUrl":null,"url":null,"abstract":"Sparse matrix multiplication (SpMM) is widely applied to numerous domains, such as graph processing and machine learning. However, inner product (IP) induces redundant zero-element computing for mismatched nonzero operands, while outer product (OP) lacks input reuse across Process Elements (PEs). Besides, current accelerators only focus on sparse-sparse matrix multiplication (SSMM) or sparse-dense matrix multiplication (SDMM), rarely performing efficiently for both. To compensate for the shortcomings of IP and OP, we propose an inner-outer-hybrid product (IOHP) method, which reuses the input matrix among PEs with IP and removes zero-element calculations with OP in each PE. Based on IOHP, we co-design a accelerator with a unified computing flow, called IOPS, to efficiently process both SSMM and SDMM. It divides the SpMM into three stages: encoding, partial sum (psum) calculation, and address mapping, where the input matrices can be reused among PEs after encoding (IP) and the zero element can be skipped in the latter two stages (OP). Furthermore, an adaptive partition strategy is proposed to tile the input matrices based on their sparsity ratios, effectively utilizing the on-chip storage and reducing DRAM access. Compared with SpArch, we achieve <inline-formula><tex-math>$1.2\\boldsymbol{\\times}$</tex-math></inline-formula>~<inline-formula><tex-math>$4.3\\boldsymbol{\\times}$</tex-math></inline-formula> performance and <inline-formula><tex-math>$1.3\\boldsymbol{\\times}$</tex-math></inline-formula>~<inline-formula><tex-math>$4.8\\boldsymbol{\\times}$</tex-math></inline-formula> energy efficiency, with <inline-formula><tex-math>$1.4\\boldsymbol{\\times}$</tex-math></inline-formula>~<inline-formula><tex-math>$2.1\\boldsymbol{\\times}$</tex-math></inline-formula> DRAM access saving.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 7","pages":"2210-2222"},"PeriodicalIF":3.8000,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computers","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10949697/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

Sparse matrix multiplication (SpMM) is widely applied to numerous domains, such as graph processing and machine learning. However, inner product (IP) induces redundant zero-element computing for mismatched nonzero operands, while outer product (OP) lacks input reuse across Process Elements (PEs). Besides, current accelerators only focus on sparse-sparse matrix multiplication (SSMM) or sparse-dense matrix multiplication (SDMM), rarely performing efficiently for both. To compensate for the shortcomings of IP and OP, we propose an inner-outer-hybrid product (IOHP) method, which reuses the input matrix among PEs with IP and removes zero-element calculations with OP in each PE. Based on IOHP, we co-design a accelerator with a unified computing flow, called IOPS, to efficiently process both SSMM and SDMM. It divides the SpMM into three stages: encoding, partial sum (psum) calculation, and address mapping, where the input matrices can be reused among PEs after encoding (IP) and the zero element can be skipped in the latter two stages (OP). Furthermore, an adaptive partition strategy is proposed to tile the input matrices based on their sparsity ratios, effectively utilizing the on-chip storage and reducing DRAM access. Compared with SpArch, we achieve

$1.2\boldsymbol{\times}$

$4.3\boldsymbol{\times}$

performance and

$1.3\boldsymbol{\times}$

$4.8\boldsymbol{\times}$

energy efficiency, with

$1.4\boldsymbol{\times}$

$2.1\boldsymbol{\times}$

DRAM access saving.

查看原文本刊更多论文

IOPS：基于内外混合产品的统一SpMM加速器

稀疏矩阵乘法（SpMM）被广泛应用于图处理和机器学习等领域。然而，对于不匹配的非零操作数，内积（IP）会导致冗余的零元素计算，而外积（OP）缺乏跨过程元素（pe）的输入重用。此外，目前的加速器只关注稀疏矩阵乘法（SSMM）或稀疏密集矩阵乘法（SDMM），很少能同时有效地执行这两种操作。为了弥补IP和OP的不足，我们提出了一种内外混合产品（IOHP）方法，该方法在具有IP的PE之间重用输入矩阵，并消除了每个PE中具有OP的零元素计算。基于IOHP，我们共同设计了一个具有统一计算流的加速器，称为IOPS，以高效地处理SSMM和SDMM。它将SpMM分为编码、部分和（psum）计算和地址映射三个阶段，其中输入矩阵在编码（IP）后可以在pe之间重用，在后两个阶段（OP）中可以跳过零元素。此外，提出了一种自适应分区策略，根据输入矩阵的稀疏度比对输入矩阵进行平铺，有效地利用了片上存储空间，减少了对DRAM的访问。与SpArch相比，我们实现了$1.2\boldsymbol{\times}$~$4.3\boldsymbol{\times}$的性能和$1.3\boldsymbol{\times}$~$4.8\boldsymbol{\times}$的能效，节省$1.4\boldsymbol{\times}$~$2.1\boldsymbol{\times}$的DRAM存取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Computers 工程技术-工程：电子与电气

CiteScore

6.60

自引率

5.40%

发文量

199

审稿时长

6.0 months

期刊介绍： The IEEE Transactions on Computers is a monthly publication with a wide distribution to researchers, developers, technical managers, and educators in the computer field. It publishes papers on research in areas of current interest to the readers. These areas include, but are not limited to, the following: a) computer organizations and architectures; b) operating systems, software systems, and communication protocols; c) real-time systems and embedded systems; d) digital devices, computer components, and interconnection networks; e) specification, design, prototyping, and testing methods and tools; f) performance, fault tolerance, reliability, security, and testability; g) case studies and experimental and theoretical evaluations; and h) new and important applications and trends.