SDMA: An Efficient and Flexible Sparse-Dense Matrix-Multiplication Architecture for GNNs

2022 32nd International Conference on Field-Programmable Logic and Applications (FPL) Pub Date : 2022-08-01 DOI:10.1109/FPL57034.2022.00054

Yingxue Gao, Lei Gong, Chao Wang, Teng Wang, Xuehai Zhou

{"title":"SDMA: An Efficient and Flexible Sparse-Dense Matrix-Multiplication Architecture for GNNs","authors":"Yingxue Gao, Lei Gong, Chao Wang, Teng Wang, Xuehai Zhou","doi":"10.1109/FPL57034.2022.00054","DOIUrl":null,"url":null,"abstract":"In recent years, graph neural networks (GNNs) as a deep learning model have emerged. Sparse-Dense Matrix Multiplication (SpMM) is the critical component of GNNs. However, SpMM involves many irregular calculations and random memory accesses, resulting in the inefficiency of general-purpose processors and dedicated accelerators. The highly sparse and uneven distribution of the graph further exacerbates the above problems. In this work, we propose SDMA, an efficient architecture to accelerate SpMM for GNNs. SDMA can collaboratively address the challenges of load imbalance and irregular memory accesses. We first present three hardware-oriented optimization methods: 1) The Equal-value partition method effectively divides the sparse matrix to achieve load balancing between tiles. 2) The vertex-clustering optimization method can explore more data locality. 3) An adaptive on-chip dataflow scheduling method is proposed to make full use of computing resources. Then, we combine and integrate the above optimization into SDMA to achieve a high-performance architecture. Finally, we prototype SDMA on the Xilinx Alveo U50 FPGA. The results demonstrate that SDMA achieves 2.19x-3.35x energy efficiency over the GPU implementation and 2.03x DSP efficiency over the FPGA implementation.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"118 6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FPL57034.2022.00054","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

In recent years, graph neural networks (GNNs) as a deep learning model have emerged. Sparse-Dense Matrix Multiplication (SpMM) is the critical component of GNNs. However, SpMM involves many irregular calculations and random memory accesses, resulting in the inefficiency of general-purpose processors and dedicated accelerators. The highly sparse and uneven distribution of the graph further exacerbates the above problems. In this work, we propose SDMA, an efficient architecture to accelerate SpMM for GNNs. SDMA can collaboratively address the challenges of load imbalance and irregular memory accesses. We first present three hardware-oriented optimization methods: 1) The Equal-value partition method effectively divides the sparse matrix to achieve load balancing between tiles. 2) The vertex-clustering optimization method can explore more data locality. 3) An adaptive on-chip dataflow scheduling method is proposed to make full use of computing resources. Then, we combine and integrate the above optimization into SDMA to achieve a high-performance architecture. Finally, we prototype SDMA on the Xilinx Alveo U50 FPGA. The results demonstrate that SDMA achieves 2.19x-3.35x energy efficiency over the GPU implementation and 2.03x DSP efficiency over the FPGA implementation.

查看原文本刊更多论文

SDMA:一种高效灵活的稀疏密集矩阵乘法gnn架构

近年来，图神经网络(gnn)作为一种深度学习模型出现了。稀疏密集矩阵乘法(SpMM)是gnn的关键组成部分。然而，SpMM涉及许多不规则计算和随机内存访问，导致通用处理器和专用加速器效率低下。图的高度稀疏和不均匀分布进一步加剧了上述问题。在这项工作中，我们提出了SDMA，一个有效的架构来加速gnn的SpMM。SDMA可以协同解决负载不平衡和内存访问不规则的挑战。首先提出了三种面向硬件的优化方法:1)等值分割法有效地分割稀疏矩阵，实现块间负载均衡。2)顶点聚类优化方法可以探索更多的数据局部性。3)为了充分利用计算资源，提出了一种自适应片上数据流调度方法。然后，我们将上述优化结合并集成到SDMA中，以实现高性能架构。最后，我们在Xilinx Alveo U50 FPGA上对SDMA进行了原型设计。结果表明，SDMA在GPU实现上的能效为2.19x-3.35倍，在FPGA实现上的能效为2.03倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)

自引率

0.00%

发文量