DMSA: An Efficient Architecture for Sparse–Sparse Matrix Multiplication Based on Distribute-Merge Product Dataflow

IF 2.8 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2025-04-23 DOI:10.1109/TVLSI.2025.3558895

Yuta Nagahara;Jiale Yan;Kazushi Kawamura;Daichi Fujiki;Masato Motomura;Thiem Van Chu

{"title":"DMSA: An Efficient Architecture for Sparse–Sparse Matrix Multiplication Based on Distribute-Merge Product Dataflow","authors":"Yuta Nagahara;Jiale Yan;Kazushi Kawamura;Daichi Fujiki;Masato Motomura;Thiem Van Chu","doi":"10.1109/TVLSI.2025.3558895","DOIUrl":null,"url":null,"abstract":"The sparse–sparse matrix multiplication (SpMSpM) is a fundamental operation in various applications. Existing SpMSpM accelerators based on inner product (IP) and outer product (OP) suffer from low computational efficiency and high memory traffic due to inefficient index matching and merging overheads. Gustavson’s product (GP)-based accelerators mitigate some of these challenges but struggle with workload imbalance and irregular memory access patterns, limiting computational parallelism. To overcome these limitations, we propose a distribute-merge product (DMP), a novel SpMSpM dataflow that evenly distributes workloads across multiple computation streams and merges partial results efficiently. We design and implement DMP-based SpMSpM architecture (DMSA), incorporating four key techniques to fully exploit the parallelism of DMP and efficiently handle irregular memory accesses. Implemented on a Xilinx ZCU106 FPGA, DMSA achieves speedups of up to <inline-formula> <tex-math>$3.38\\times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$1.73\\times $ </tex-math></inline-formula> over two state-of-the-art FPGA-based SpMSpM accelerators while maintaining comparable hardware resource usage. In addition, compared to CPU and GPU implementations on an NVIDIA Jetson AGX Xavier, DMSA is <inline-formula> <tex-math>$4.96\\times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$1.53\\times $ </tex-math></inline-formula> faster while achieving <inline-formula> <tex-math>$6.67\\times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$2.33\\times $ </tex-math></inline-formula> better energy efficiency, respectively.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 7","pages":"1858-1871"},"PeriodicalIF":2.8000,"publicationDate":"2025-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10974734/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

The sparse–sparse matrix multiplication (SpMSpM) is a fundamental operation in various applications. Existing SpMSpM accelerators based on inner product (IP) and outer product (OP) suffer from low computational efficiency and high memory traffic due to inefficient index matching and merging overheads. Gustavson’s product (GP)-based accelerators mitigate some of these challenges but struggle with workload imbalance and irregular memory access patterns, limiting computational parallelism. To overcome these limitations, we propose a distribute-merge product (DMP), a novel SpMSpM dataflow that evenly distributes workloads across multiple computation streams and merges partial results efficiently. We design and implement DMP-based SpMSpM architecture (DMSA), incorporating four key techniques to fully exploit the parallelism of DMP and efficiently handle irregular memory accesses. Implemented on a Xilinx ZCU106 FPGA, DMSA achieves speedups of up to

$3.38\times $

and

$1.73\times $

over two state-of-the-art FPGA-based SpMSpM accelerators while maintaining comparable hardware resource usage. In addition, compared to CPU and GPU implementations on an NVIDIA Jetson AGX Xavier, DMSA is

$4.96\times $

and

$1.53\times $

faster while achieving

$6.67\times $

and

$2.33\times $

better energy efficiency, respectively.

查看原文本刊更多论文

DMSA：一种基于分布-合并产品数据流的稀疏-稀疏矩阵乘法的高效架构

稀疏-稀疏矩阵乘法（SpMSpM）是各种应用中的基本运算。现有的基于内积（IP）和外积（OP）的SpMSpM加速器由于低效的索引匹配和合并开销而存在计算效率低和内存流量大的问题。Gustavson的基于产品（GP）的加速器减轻了这些挑战中的一些，但仍在努力解决工作负载不平衡和不规则的内存访问模式，限制了计算的并行性。为了克服这些限制，我们提出了一种分布式合并产品（DMP），一种新的SpMSpM数据流，它在多个计算流中均匀地分配工作负载并有效地合并部分结果。我们设计并实现了基于DMP的SpMSpM架构（DMSA），该架构结合了四种关键技术来充分利用DMP的并行性，并有效地处理不规则的内存访问。在Xilinx ZCU106 FPGA上实现，DMSA比两个最先进的基于FPGA的SpMSpM加速器实现高达3.38倍和1.73倍的加速，同时保持相当的硬件资源使用。此外，与NVIDIA Jetson AGX Xavier上的CPU和GPU实现相比，DMSA的速度分别提高了4.96倍和1.53倍，同时能效分别提高了6.67倍和2.33倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Very Large Scale Integration (VLSI) Systems 工程技术-工程：电子与电气

CiteScore

6.40

自引率

7.10%

发文量

187

审稿时长

3.6 months

期刊介绍： The IEEE Transactions on VLSI Systems is published as a monthly journal under the co-sponsorship of the IEEE Circuits and Systems Society, the IEEE Computer Society, and the IEEE Solid-State Circuits Society. Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels. To address this critical area through a common forum, the IEEE Transactions on VLSI Systems have been founded. The editorial board, consisting of international experts, invites original papers which emphasize and merit the novel systems integration aspects of microelectronic systems including interactions among systems design and partitioning, logic and memory design, digital and analog circuit design, layout synthesis, CAD tools, chips and wafer fabrication, testing and packaging, and systems level qualification. Thus, the coverage of these Transactions will focus on VLSI/ULSI microelectronic systems integration.