Scaling Sparse Matrix Multiplication on CPU-GPU Nodes

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2021-05-01 DOI:10.1109/IPDPS49936.2021.00047

Yang Xia, Peng Jiang, G. Agrawal, R. Ramnath

{"title":"Scaling Sparse Matrix Multiplication on CPU-GPU Nodes","authors":"Yang Xia, Peng Jiang, G. Agrawal, R. Ramnath","doi":"10.1109/IPDPS49936.2021.00047","DOIUrl":null,"url":null,"abstract":"Multiplication of two sparse matrices (SpGEMM) is a popular kernel behind many numerical solvers, and also features in implementing many common graph algorithms. Though many recent research efforts have focused on implementing SpGEMM efficiently on a single GPU, none of the existing work has considered the case where the memory requirements exceed the size of GPU memory. Similarly, the use of the aggregate computing power of CPU and GPU has also not been addressed for those large matrices. In this paper, we present a framework for scaling SpGEMM computations for matrices that do not fit into GPU memory. We address how the computation and data can be partitioned across kernel executions on GPUs. An important emphasis in our work is overlapping data movement and computation. We achieve this by addressing many challenges, such as avoiding dynamic memory allocations, and re-scheduling data transfers with the computation of chunks. We extend our framework to make efficient use of both GPU and CPU, by developing an efficient work distribution strategy. Our evaluation on 9 large matrices shows that our out-of-core GPU implementation achieves 1.98-3.03X speedups over a state-of-the-art multi-core CPU implementation, our hybrid implementation further achieves speedups up to 3.74x, and that our design choices are directly contributing towards achieving this performance.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS49936.2021.00047","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Multiplication of two sparse matrices (SpGEMM) is a popular kernel behind many numerical solvers, and also features in implementing many common graph algorithms. Though many recent research efforts have focused on implementing SpGEMM efficiently on a single GPU, none of the existing work has considered the case where the memory requirements exceed the size of GPU memory. Similarly, the use of the aggregate computing power of CPU and GPU has also not been addressed for those large matrices. In this paper, we present a framework for scaling SpGEMM computations for matrices that do not fit into GPU memory. We address how the computation and data can be partitioned across kernel executions on GPUs. An important emphasis in our work is overlapping data movement and computation. We achieve this by addressing many challenges, such as avoiding dynamic memory allocations, and re-scheduling data transfers with the computation of chunks. We extend our framework to make efficient use of both GPU and CPU, by developing an efficient work distribution strategy. Our evaluation on 9 large matrices shows that our out-of-core GPU implementation achieves 1.98-3.03X speedups over a state-of-the-art multi-core CPU implementation, our hybrid implementation further achieves speedups up to 3.74x, and that our design choices are directly contributing towards achieving this performance.

查看原文本刊更多论文

在CPU-GPU节点上缩放稀疏矩阵乘法

两个稀疏矩阵的乘法(SpGEMM)是许多数值求解器背后的流行内核，也是实现许多常见图算法的特征。尽管最近的许多研究工作都集中在在单个GPU上有效地实现SpGEMM，但现有的工作都没有考虑到内存需求超过GPU内存大小的情况。同样，对于那些大型矩阵，CPU和GPU的综合计算能力的使用也没有得到解决。在本文中，我们提出了一个框架，用于缩放不适合GPU内存的矩阵的SpGEMM计算。我们将讨论如何在gpu上跨内核执行对计算和数据进行分区。我们工作的重点是重叠数据移动和计算。我们通过解决许多挑战来实现这一目标，例如避免动态内存分配，以及通过计算块来重新调度数据传输。通过开发高效的工作分配策略，我们扩展了我们的框架，以有效地利用GPU和CPU。我们对9个大型矩阵的评估表明，我们的外核GPU实现比最先进的多核CPU实现实现了1.98-3.03倍的加速，我们的混合实现进一步实现了高达3.74倍的加速，我们的设计选择直接有助于实现这一性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

自引率

0.00%

发文量