{"title":"Efficient Asynchronous GCN Training on a GPU Cluster","authors":"Y. Zhang, D. Goswami","doi":"10.1109/ICPADS53394.2021.00086","DOIUrl":null,"url":null,"abstract":"Research on Graph Convolutional Networks (GCNs) has increasingly gained popularity in recent years due to the powerful representational capacity of graphs. A common assumption in traditional synchronous parallel training of GCNs using multiple GPUs is that load is perfectly balanced. However, this assumption may not hold in a real-world scenario where there can be imbalances in workloads among GPUs for various reasons. In a synchronous parallel implementation, a straggler in the system can limit the overall speed up of parallel training. To address these performance issues, this research investigates approaches for asynchronous decentralized parallel training of GCNs on a GPU cluster. The techniques investigated are based on graph clustering and the Gossip protocol. The research specifically adapts the approach of Cluster GCN, which uses graph partitioning for SGD based training, and combines with a gossip algorithm specifically designed for a GPU cluster to periodically exchange gradients among randomly chosen partners (GPUs). In addition, it incorporates a work pool mechanism for load balancing among GPUs. The gossip algorithm is proven to be deadlock free. The implementation is performed on a deep learning cluster with 8 Tesla V100 GPUs per compute node, and PyTorch and DGL as the software platforms. Experiments are conducted on different benchmark datasets. The results demonstrate superior performance with similar accuracy scores, as compared to traditional synchronous training which uses “all reduce” to synchronously accumulate parallel training results.","PeriodicalId":309508,"journal":{"name":"2021 IEEE 27th International Conference on Parallel and Distributed Systems (ICPADS)","volume":"83 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE 27th International Conference on Parallel and Distributed Systems (ICPADS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPADS53394.2021.00086","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Research on Graph Convolutional Networks (GCNs) has increasingly gained popularity in recent years due to the powerful representational capacity of graphs. A common assumption in traditional synchronous parallel training of GCNs using multiple GPUs is that load is perfectly balanced. However, this assumption may not hold in a real-world scenario where there can be imbalances in workloads among GPUs for various reasons. In a synchronous parallel implementation, a straggler in the system can limit the overall speed up of parallel training. To address these performance issues, this research investigates approaches for asynchronous decentralized parallel training of GCNs on a GPU cluster. The techniques investigated are based on graph clustering and the Gossip protocol. The research specifically adapts the approach of Cluster GCN, which uses graph partitioning for SGD based training, and combines with a gossip algorithm specifically designed for a GPU cluster to periodically exchange gradients among randomly chosen partners (GPUs). In addition, it incorporates a work pool mechanism for load balancing among GPUs. The gossip algorithm is proven to be deadlock free. The implementation is performed on a deep learning cluster with 8 Tesla V100 GPUs per compute node, and PyTorch and DGL as the software platforms. Experiments are conducted on different benchmark datasets. The results demonstrate superior performance with similar accuracy scores, as compared to traditional synchronous training which uses “all reduce” to synchronously accumulate parallel training results.