Yingxue Gao, Lei Gong, Chao Wang, Teng Wang, Xuehai Zhou
{"title":"SDMA:一种高效灵活的稀疏密集矩阵乘法gnn架构","authors":"Yingxue Gao, Lei Gong, Chao Wang, Teng Wang, Xuehai Zhou","doi":"10.1109/FPL57034.2022.00054","DOIUrl":null,"url":null,"abstract":"In recent years, graph neural networks (GNNs) as a deep learning model have emerged. Sparse-Dense Matrix Multiplication (SpMM) is the critical component of GNNs. However, SpMM involves many irregular calculations and random memory accesses, resulting in the inefficiency of general-purpose processors and dedicated accelerators. The highly sparse and uneven distribution of the graph further exacerbates the above problems. In this work, we propose SDMA, an efficient architecture to accelerate SpMM for GNNs. SDMA can collaboratively address the challenges of load imbalance and irregular memory accesses. We first present three hardware-oriented optimization methods: 1) The Equal-value partition method effectively divides the sparse matrix to achieve load balancing between tiles. 2) The vertex-clustering optimization method can explore more data locality. 3) An adaptive on-chip dataflow scheduling method is proposed to make full use of computing resources. Then, we combine and integrate the above optimization into SDMA to achieve a high-performance architecture. Finally, we prototype SDMA on the Xilinx Alveo U50 FPGA. The results demonstrate that SDMA achieves 2.19x-3.35x energy efficiency over the GPU implementation and 2.03x DSP efficiency over the FPGA implementation.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"118 6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"SDMA: An Efficient and Flexible Sparse-Dense Matrix-Multiplication Architecture for GNNs\",\"authors\":\"Yingxue Gao, Lei Gong, Chao Wang, Teng Wang, Xuehai Zhou\",\"doi\":\"10.1109/FPL57034.2022.00054\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In recent years, graph neural networks (GNNs) as a deep learning model have emerged. Sparse-Dense Matrix Multiplication (SpMM) is the critical component of GNNs. However, SpMM involves many irregular calculations and random memory accesses, resulting in the inefficiency of general-purpose processors and dedicated accelerators. The highly sparse and uneven distribution of the graph further exacerbates the above problems. In this work, we propose SDMA, an efficient architecture to accelerate SpMM for GNNs. SDMA can collaboratively address the challenges of load imbalance and irregular memory accesses. We first present three hardware-oriented optimization methods: 1) The Equal-value partition method effectively divides the sparse matrix to achieve load balancing between tiles. 2) The vertex-clustering optimization method can explore more data locality. 3) An adaptive on-chip dataflow scheduling method is proposed to make full use of computing resources. Then, we combine and integrate the above optimization into SDMA to achieve a high-performance architecture. Finally, we prototype SDMA on the Xilinx Alveo U50 FPGA. The results demonstrate that SDMA achieves 2.19x-3.35x energy efficiency over the GPU implementation and 2.03x DSP efficiency over the FPGA implementation.\",\"PeriodicalId\":380116,\"journal\":{\"name\":\"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)\",\"volume\":\"118 6 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/FPL57034.2022.00054\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FPL57034.2022.00054","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
SDMA: An Efficient and Flexible Sparse-Dense Matrix-Multiplication Architecture for GNNs
In recent years, graph neural networks (GNNs) as a deep learning model have emerged. Sparse-Dense Matrix Multiplication (SpMM) is the critical component of GNNs. However, SpMM involves many irregular calculations and random memory accesses, resulting in the inefficiency of general-purpose processors and dedicated accelerators. The highly sparse and uneven distribution of the graph further exacerbates the above problems. In this work, we propose SDMA, an efficient architecture to accelerate SpMM for GNNs. SDMA can collaboratively address the challenges of load imbalance and irregular memory accesses. We first present three hardware-oriented optimization methods: 1) The Equal-value partition method effectively divides the sparse matrix to achieve load balancing between tiles. 2) The vertex-clustering optimization method can explore more data locality. 3) An adaptive on-chip dataflow scheduling method is proposed to make full use of computing resources. Then, we combine and integrate the above optimization into SDMA to achieve a high-performance architecture. Finally, we prototype SDMA on the Xilinx Alveo U50 FPGA. The results demonstrate that SDMA achieves 2.19x-3.35x energy efficiency over the GPU implementation and 2.03x DSP efficiency over the FPGA implementation.