Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning

Proceedings of the 23rd European MPI Users' Group Meeting Pub Date : 2016-09-25 DOI:10.1145/2966884.2966912

A. Awan, Khaled Hamidouche, Akshay Venkatesh, D. Panda

{"title":"Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning","authors":"A. Awan, Khaled Hamidouche, Akshay Venkatesh, D. Panda","doi":"10.1145/2966884.2966912","DOIUrl":null,"url":null,"abstract":"Emerging paradigms like High Performance Data Analytics (HPDA) and Deep Learning (DL) pose at least two new design challenges for existing MPI runtimes. First, these paradigms require an efficient support for communicating unusually large messages across processes. And second, the communication buffers used by HPDA applications and DL frameworks generally reside on a GPU's memory. In this context, we observe that conventional MPI runtimes have been optimized over decades to achieve lowest possible communication latency for relatively smaller message sizes (up-to 1 Megabyte) and that too for CPU memory buffers. With the advent of CUDA-Aware MPI runtimes, a lot of research has been conducted to improve performance of GPU buffer based communication. However, little exists in current state of the art that deals with very large message communication of GPU buffers. In this paper, we investigate these new challenges by analyzing the performance bottlenecks in existing CUDA-Aware MPI runtimes like MVAPICH2-GDR, and propose hierarchical collective designs to improve communication latency of the MPI_Bcast primitive by exploiting a new communication library called NCCL. To the best of our knowledge, this is the first work that addresses these new requirements where GPU buffers are used for communication with message sizes surpassing hundreds of megabytes. We highlight the design challenges for our work along with the details of design and implementation. In addition, we provide a comprehensive performance evaluation using a Micro-benchmark and a CUDA-Aware adaptation of Microsoft CNTK DL framework. We report up to 47% improvement in training time for CNTK using the proposed hierarchical MPI_Bcast design.","PeriodicalId":264069,"journal":{"name":"Proceedings of the 23rd European MPI Users' Group Meeting","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"40","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 23rd European MPI Users' Group Meeting","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2966884.2966912","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 40

Abstract

Emerging paradigms like High Performance Data Analytics (HPDA) and Deep Learning (DL) pose at least two new design challenges for existing MPI runtimes. First, these paradigms require an efficient support for communicating unusually large messages across processes. And second, the communication buffers used by HPDA applications and DL frameworks generally reside on a GPU's memory. In this context, we observe that conventional MPI runtimes have been optimized over decades to achieve lowest possible communication latency for relatively smaller message sizes (up-to 1 Megabyte) and that too for CPU memory buffers. With the advent of CUDA-Aware MPI runtimes, a lot of research has been conducted to improve performance of GPU buffer based communication. However, little exists in current state of the art that deals with very large message communication of GPU buffers. In this paper, we investigate these new challenges by analyzing the performance bottlenecks in existing CUDA-Aware MPI runtimes like MVAPICH2-GDR, and propose hierarchical collective designs to improve communication latency of the MPI_Bcast primitive by exploiting a new communication library called NCCL. To the best of our knowledge, this is the first work that addresses these new requirements where GPU buffers are used for communication with message sizes surpassing hundreds of megabytes. We highlight the design challenges for our work along with the details of design and implementation. In addition, we provide a comprehensive performance evaluation using a Micro-benchmark and a CUDA-Aware adaptation of Microsoft CNTK DL framework. We report up to 47% improvement in training time for CNTK using the proposed hierarchical MPI_Bcast design.

查看原文本刊更多论文

利用NCCL和cuda感知MPI进行深度学习的高效大消息广播

高性能数据分析(HPDA)和深度学习(DL)等新兴范式给现有的MPI运行时带来了至少两个新的设计挑战。首先，这些范例需要有效地支持跨流程通信异常大的消息。其次，HPDA应用程序和DL框架使用的通信缓冲区通常驻留在GPU的内存中。在这种情况下，我们观察到传统的MPI运行时已经经过了几十年的优化，以便在相对较小的消息大小(最多1兆字节)和CPU内存缓冲区中实现尽可能低的通信延迟。随着CUDA-Aware MPI运行时的出现，人们进行了大量的研究来提高基于GPU缓冲区的通信性能。然而，在当前的艺术状态中，很少存在处理GPU缓冲区的非常大的消息通信。在本文中，我们通过分析现有CUDA-Aware MPI运行时(如MVAPICH2-GDR)的性能瓶颈来研究这些新挑战，并提出了分层集体设计，通过利用称为NCCL的新通信库来改善MPI_Bcast原语的通信延迟。据我们所知，这是第一个解决这些新要求的工作，其中GPU缓冲区用于消息大小超过数百兆字节的通信。我们强调了我们工作中的设计挑战以及设计和实现的细节。此外，我们使用Micro-benchmark和Microsoft CNTK DL框架的CUDA-Aware改编提供了全面的性能评估。我们报告说，使用提出的分层MPI_Bcast设计，CNTK的训练时间提高了47%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 23rd European MPI Users' Group Meeting

自引率

0.00%

发文量