Analyzing Multi-trillion Edge Graphs on Large GPU Clusters: A Case Study with PageRank

2022 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2022-09-19 DOI:10.1109/HPEC55821.2022.9926341

Seunghwa Kang, Joseph Nke, Brad Rees

{"title":"Analyzing Multi-trillion Edge Graphs on Large GPU Clusters: A Case Study with PageRank","authors":"Seunghwa Kang, Joseph Nke, Brad Rees","doi":"10.1109/HPEC55821.2022.9926341","DOIUrl":null,"url":null,"abstract":"We previously reported PageRank performance results on a cluster with 32 A100 GPUs [7]. This paper extends the previous work to 2048 GPUs. The previous implementation performs well as long as the number of G PU s is small relative to the square of the average vertex degree but its scalability deteriorates as the number of GPUs further increases. We updated our previous implementation with the following objectives: 1) enable analyzing a P times larger graph with P times more GPUs up to P = 2048, 2) achieve reasonably good weak scaling, and 3) integrate the improvements to the open-source data science ecosystem (i.e. RAPIDS cuGraph, https://github.com/rapidsai/cugraph). While we evaluate the updates with PageRank in this paper, they improve the scalability of a broader set of algorithms in cuGraph. To be more specific, we updated our 2D edge partitioning scheme; implemented the PDCSC (partially doubly compressed sparse column) format which is a hybrid data structure that combines CSC (compressed sparse column) and DCSC (doubly compressed sparse column); adopted (key, value) pairs to store edge source vertex property values; and improved the reduction communication strategy. The 32 GPU cluster has A100 GPUs (40 GB HBM per GPU) connected with NVLink. We ran the updated implementation on the Selene supercomputer which uses InfiniBand for inter-node communication and NVLink for intra-node communication. Each Selene node has eight A100 GPUs (80 GB HBM per GPU). Analyzing the web crawl graph (3.563 billion vertices and 128.7 billion edges, 32 bit vertex ID, unweighted, average vertex degree: 36.12) took 0.187 second per Page Rank iteration on the 32 GPU cluster. Computing Page Rank scores of a scale 38 R-mat graph (274.9 billion vertices and 4.398 trillion edges, 64 bit vertex ID, 32 bit edge weight, average vertex degree: 16) took 1.54 second per Page Rank iteration on the Selene supercomputer with 2048 GPUs. We conclude this paper discussing potential network system enhancements to improve the scaling.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"76 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPEC55821.2022.9926341","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

We previously reported PageRank performance results on a cluster with 32 A100 GPUs [7]. This paper extends the previous work to 2048 GPUs. The previous implementation performs well as long as the number of G PU s is small relative to the square of the average vertex degree but its scalability deteriorates as the number of GPUs further increases. We updated our previous implementation with the following objectives: 1) enable analyzing a P times larger graph with P times more GPUs up to P = 2048, 2) achieve reasonably good weak scaling, and 3) integrate the improvements to the open-source data science ecosystem (i.e. RAPIDS cuGraph, https://github.com/rapidsai/cugraph). While we evaluate the updates with PageRank in this paper, they improve the scalability of a broader set of algorithms in cuGraph. To be more specific, we updated our 2D edge partitioning scheme; implemented the PDCSC (partially doubly compressed sparse column) format which is a hybrid data structure that combines CSC (compressed sparse column) and DCSC (doubly compressed sparse column); adopted (key, value) pairs to store edge source vertex property values; and improved the reduction communication strategy. The 32 GPU cluster has A100 GPUs (40 GB HBM per GPU) connected with NVLink. We ran the updated implementation on the Selene supercomputer which uses InfiniBand for inter-node communication and NVLink for intra-node communication. Each Selene node has eight A100 GPUs (80 GB HBM per GPU). Analyzing the web crawl graph (3.563 billion vertices and 128.7 billion edges, 32 bit vertex ID, unweighted, average vertex degree: 36.12) took 0.187 second per Page Rank iteration on the 32 GPU cluster. Computing Page Rank scores of a scale 38 R-mat graph (274.9 billion vertices and 4.398 trillion edges, 64 bit vertex ID, 32 bit edge weight, average vertex degree: 16) took 1.54 second per Page Rank iteration on the Selene supercomputer with 2048 GPUs. We conclude this paper discussing potential network system enhancements to improve the scaling.

查看原文本刊更多论文

在大型GPU集群上分析数万亿个边缘图:以PageRank为例

我们之前报道了一个拥有32个A100 gpu的集群上的PageRank性能结果[7]。本文将之前的工作扩展到2048个gpu。只要gpu数量相对于平均顶点度的平方较小，以前的实现性能良好，但随着gpu数量的进一步增加，其可扩展性会下降。我们以以下目标更新了之前的实现:1)能够使用P倍的gpu分析P倍大的图形，最高可达P = 2048, 2)实现相当好的弱缩放，以及3)将改进集成到开源数据科学生态系统(即RAPIDS cuGraph, https://github.com/rapidsai/cugraph)。虽然我们在本文中评估了PageRank的更新，但它们提高了cuGraph中更广泛的算法集的可扩展性。更具体地说，我们更新了我们的二维边缘划分方案;实现了PDCSC(部分双压缩稀疏列)格式，这是一种将CSC(压缩稀疏列)和DCSC(双压缩稀疏列)相结合的混合数据结构;采用(键，值)对存储边源顶点属性值;并改进了减量沟通策略。32 GPU的集群采用A100 GPU(每个GPU 40gb HBM)连接NVLink。我们在Selene超级计算机上运行更新后的实现，该计算机使用InfiniBand进行节点间通信，使用NVLink进行节点内通信。每个Selene节点有8个A100图形处理器(每个图形处理器80gb HBM)。分析网络爬行图(35.63亿个顶点和1287亿个边，32位顶点ID，未加权，平均顶点度:36.12)在32 GPU集群上每次Page Rank迭代耗时0.187秒。计算规模为38的R-mat图(2749亿个顶点和4.398万亿个边，64位顶点ID, 32位边权，平均顶点度:16)的Page Rank分数在具有2048个gpu的Selene超级计算机上每次Page Rank迭代花费1.54秒。最后，我们讨论了潜在的网络系统增强以提高可扩展性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 IEEE High Performance Extreme Computing Conference (HPEC)

自引率

0.00%

发文量