DSP: Efficient GNN Training with Multiple GPUs

Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming Pub Date : 2023-02-25 DOI:10.1145/3572848.3577528

Zhenkun Cai, Qihui Zhou, Xiao Yan, Da Zheng, Xiang Song, Chenguang Zheng, James Cheng, G. Karypis

{"title":"DSP: Efficient GNN Training with Multiple GPUs","authors":"Zhenkun Cai, Qihui Zhou, Xiao Yan, Da Zheng, Xiang Song, Chenguang Zheng, James Cheng, G. Karypis","doi":"10.1145/3572848.3577528","DOIUrl":null,"url":null,"abstract":"Jointly utilizing multiple GPUs to train graph neural networks (GNNs) is crucial for handling large graphs and achieving high efficiency. However, we find that existing systems suffer from high communication costs and low GPU utilization due to improper data layout and training procedures. Thus, we propose a system dubbed Distributed Sampling and Pipelining (DSP) for multi-GPU GNN training. DSP adopts a tailored data layout to utilize the fast NVLink connections among the GPUs, which stores the graph topology and popular node features in GPU memory. For efficient graph sampling with multiple GPUs, we introduce a collective sampling primitive (CSP), which pushes the sampling tasks to data to reduce communication. We also design a producer-consumer-based pipeline, which allows tasks from different mini-batches to run congruently to improve GPU utilization. We compare DSP with state-of-the-art GNN training frameworks, and the results show that DSP consistently outperforms the baselines under different datasets, GNN models and GPU counts. The speedup of DSP can be up to 26x and is over 2x in most cases.","PeriodicalId":233744,"journal":{"name":"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3572848.3577528","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

Abstract

Jointly utilizing multiple GPUs to train graph neural networks (GNNs) is crucial for handling large graphs and achieving high efficiency. However, we find that existing systems suffer from high communication costs and low GPU utilization due to improper data layout and training procedures. Thus, we propose a system dubbed Distributed Sampling and Pipelining (DSP) for multi-GPU GNN training. DSP adopts a tailored data layout to utilize the fast NVLink connections among the GPUs, which stores the graph topology and popular node features in GPU memory. For efficient graph sampling with multiple GPUs, we introduce a collective sampling primitive (CSP), which pushes the sampling tasks to data to reduce communication. We also design a producer-consumer-based pipeline, which allows tasks from different mini-batches to run congruently to improve GPU utilization. We compare DSP with state-of-the-art GNN training frameworks, and the results show that DSP consistently outperforms the baselines under different datasets, GNN models and GPU counts. The speedup of DSP can be up to 26x and is over 2x in most cases.

查看原文本刊更多论文

DSP:多gpu的高效GNN训练

联合使用多个gpu来训练图神经网络(gnn)对于处理大型图和实现高效率至关重要。然而，我们发现由于数据布局和训练程序不当，现有系统的通信成本高，GPU利用率低。因此，我们提出了一种用于多gpu GNN训练的分布式采样和流水线(DSP)系统。DSP采用量身定制的数据布局，利用GPU之间的快速NVLink连接，将图形拓扑和流行节点特征存储在GPU内存中。为了实现多gpu的高效图采样，我们引入了集体采样原语(CSP)，将采样任务推到数据上以减少通信。我们还设计了一个基于生产者-消费者的管道，它允许来自不同小批量的任务一致运行，以提高GPU利用率。我们将DSP与最先进的GNN训练框架进行了比较，结果表明DSP在不同的数据集、GNN模型和GPU数量下始终优于基线。DSP的加速可达26x，大多数情况下超过2x。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

自引率

0.00%

发文量