Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?

Proceedings of the 25th European MPI Users' Group Meeting Pub Date : 2017-07-28 DOI:10.1145/3236367.3236381

A. Awan, Ching-Hsiang Chu, H. Subramoni, D. Panda

{"title":"Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?","authors":"A. Awan, Ching-Hsiang Chu, H. Subramoni, D. Panda","doi":"10.1145/3236367.3236381","DOIUrl":null,"url":null,"abstract":"Traditionally, MPI runtimes have been designed for clusters with a large number of nodes. However, with the advent of MPI+CUDA applications and dense multi-GPU systems, it has become important to design efficient communication schemes. This coupled with new application workloads brought forward by Deep Learning frameworks like Caffe and Microsoft CNTK pose additional design constraints due to very large message communication of GPU buffers during the training phase. In this context, special-purpose libraries like NCCL have been proposed. In this paper, we propose a pipelined chain (ring) design for the MPI_Bcast collective operation along with an enhanced collective tuning framework in MVAPICH2-GDR that enables efficient intra-/internode multi-GPU communication. We present an in-depth performance landscape for the proposed MPI_Bcast schemes along with a comparative analysis of NCCL Broadcast and NCCL-based MPI_Bcast. The proposed designs for MVAPICH2-GDR enable up to 14X and 16.6X improvement, compared to NCCL-based solutions, for intra- and internode broadcast latency, respectively. In addition, the proposed designs provide up to 7% improvement over NCCL-based solutions for data parallel training of the VGG network on 128 GPUs using Microsoft CNTK. The proposed solutions outperform the recently introduced NCCL2 library for small and medium message sizes and offer comparable/better performance for very large message sizes.","PeriodicalId":225539,"journal":{"name":"Proceedings of the 25th European MPI Users' Group Meeting","volume":"54 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"41","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 25th European MPI Users' Group Meeting","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3236367.3236381","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 41

Abstract

Traditionally, MPI runtimes have been designed for clusters with a large number of nodes. However, with the advent of MPI+CUDA applications and dense multi-GPU systems, it has become important to design efficient communication schemes. This coupled with new application workloads brought forward by Deep Learning frameworks like Caffe and Microsoft CNTK pose additional design constraints due to very large message communication of GPU buffers during the training phase. In this context, special-purpose libraries like NCCL have been proposed. In this paper, we propose a pipelined chain (ring) design for the MPI_Bcast collective operation along with an enhanced collective tuning framework in MVAPICH2-GDR that enables efficient intra-/internode multi-GPU communication. We present an in-depth performance landscape for the proposed MPI_Bcast schemes along with a comparative analysis of NCCL Broadcast and NCCL-based MPI_Bcast. The proposed designs for MVAPICH2-GDR enable up to 14X and 16.6X improvement, compared to NCCL-based solutions, for intra- and internode broadcast latency, respectively. In addition, the proposed designs provide up to 7% improvement over NCCL-based solutions for data parallel training of the VGG network on 128 GPUs using Microsoft CNTK. The proposed solutions outperform the recently introduced NCCL2 library for small and medium message sizes and offer comparable/better performance for very large message sizes.

查看原文本刊更多论文

密集gpu InfiniBand集群上深度学习工作负载的优化广播:MPI还是NCCL?

传统上，MPI运行时是为具有大量节点的集群设计的。然而，随着MPI+CUDA应用和密集的多gpu系统的出现，设计高效的通信方案变得非常重要。这与深度学习框架(如Caffe和Microsoft CNTK)带来的新应用程序工作负载相结合，由于在训练阶段GPU缓冲区的消息通信非常大，因此带来了额外的设计约束。在这种背景下，像NCCL这样的专用库被提出。在本文中，我们提出了一种用于MPI_Bcast集体操作的流水线链(环)设计以及MVAPICH2-GDR中的增强集体调优框架，该框架可实现高效的节点内/节点间多gpu通信。我们对所提出的MPI_Bcast方案进行了深入的性能分析，并对NCCL广播和基于NCCL的MPI_Bcast进行了比较分析。与基于nccl的解决方案相比，MVAPICH2-GDR的建议设计在节点内和节点间的广播延迟分别提高了14倍和16.6倍。此外，与基于nccl的解决方案相比，所提出的设计在128个gpu上使用Microsoft CNTK进行VGG网络的数据并行训练时提供了高达7%的改进。建议的解决方案在中小型消息大小方面优于最近引入的NCCL2库，并在非常大的消息大小方面提供相当/更好的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 25th European MPI Users' Group Meeting

自引率

0.00%

发文量