Communication Optimization for Distributed Execution of Graph Neural Networks

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2023-05-01 DOI:10.1109/IPDPS54959.2023.00058

Süreyya Emre Kurt, Jinghua Yan, Aravind Sukumaran-Rajam, Prashant Pandey, P. Sadayappan

{"title":"Communication Optimization for Distributed Execution of Graph Neural Networks","authors":"Süreyya Emre Kurt, Jinghua Yan, Aravind Sukumaran-Rajam, Prashant Pandey, P. Sadayappan","doi":"10.1109/IPDPS54959.2023.00058","DOIUrl":null,"url":null,"abstract":"Graph Neural Networks (GNNs) have emerged as a very powerful and popular machine learning model for numerous application domains. Each stage of a GNN requires an aggregation (sparse matrix-matrix multiplication) and a linear operation (dense matrix-matrix multiplication). Numerous efforts have addressed the development of distributed implementations for GNNs. Although efficient algorithms for distributed matrix multiplication are well known, the challenge here is the collective optimization of sequences of distributed matrix-matrix multiplications required for GNN, where many degrees of freedom also exist in the ordering of the component matrix-multiplication operations.This paper develops a new approach to distributed GNN, ReDistribution of Matrices (RDM), centered around communication-free distributed matrix-multiplication enabled by matrix redistribution between GNN stages. While the approach is applicable to the numerous algorithmic variants of GNN, the experimental evaluation focuses on GCN (Graph Convolutional Network), including both full-batch training as well as sampling-based training using GraphSAINT. Experimental evaluation with 2-layer and 3-layer GCN, using 128 or 256 hidden features, across eight sparse datasets, on a multi-GPU system with 8 GPUs shows that RDM attains a geometric mean speedup between 2× and 3.7× over two state-of-the-art multi-GPU GCN implementations, CAGNET and DGCL.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS54959.2023.00058","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Graph Neural Networks (GNNs) have emerged as a very powerful and popular machine learning model for numerous application domains. Each stage of a GNN requires an aggregation (sparse matrix-matrix multiplication) and a linear operation (dense matrix-matrix multiplication). Numerous efforts have addressed the development of distributed implementations for GNNs. Although efficient algorithms for distributed matrix multiplication are well known, the challenge here is the collective optimization of sequences of distributed matrix-matrix multiplications required for GNN, where many degrees of freedom also exist in the ordering of the component matrix-multiplication operations.This paper develops a new approach to distributed GNN, ReDistribution of Matrices (RDM), centered around communication-free distributed matrix-multiplication enabled by matrix redistribution between GNN stages. While the approach is applicable to the numerous algorithmic variants of GNN, the experimental evaluation focuses on GCN (Graph Convolutional Network), including both full-batch training as well as sampling-based training using GraphSAINT. Experimental evaluation with 2-layer and 3-layer GCN, using 128 or 256 hidden features, across eight sparse datasets, on a multi-GPU system with 8 GPUs shows that RDM attains a geometric mean speedup between 2× and 3.7× over two state-of-the-art multi-GPU GCN implementations, CAGNET and DGCL.

查看原文本刊更多论文

图神经网络分布式执行的通信优化

图神经网络(gnn)已经成为一种非常强大和流行的机器学习模型，适用于许多应用领域。GNN的每个阶段都需要一个聚合(稀疏矩阵-矩阵乘法)和一个线性运算(密集矩阵-矩阵乘法)。许多努力已经解决了gnn的分布式实现的开发。尽管分布式矩阵乘法的高效算法是众所周知的，但这里的挑战是GNN所需的分布式矩阵-矩阵乘法序列的集体优化，其中在分量矩阵-乘法操作的排序中也存在许多自由度。本文提出了一种分布式GNN的新方法，即矩阵的再分配(RDM)，该方法以GNN各阶段之间的矩阵再分配实现无通信的分布式矩阵乘法为中心。虽然该方法适用于GNN的众多算法变体，但实验评估主要集中在GCN(图卷积网络)上，包括全批训练和使用GraphSAINT的基于采样的训练。在8个gpu的多gpu系统上，在8个稀疏数据集上使用128或256个隐藏特征对2层和3层GCN进行的实验评估表明，RDM比两种最先进的多gpu GCN实现CAGNET和DGCL获得了2到3.7倍的几何平均加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

自引率

0.00%

发文量