A. Awan, Khaled Hamidouche, Akshay Venkatesh, D. Panda
{"title":"Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning","authors":"A. Awan, Khaled Hamidouche, Akshay Venkatesh, D. Panda","doi":"10.1145/2966884.2966912","DOIUrl":null,"url":null,"abstract":"Emerging paradigms like High Performance Data Analytics (HPDA) and Deep Learning (DL) pose at least two new design challenges for existing MPI runtimes. First, these paradigms require an efficient support for communicating unusually large messages across processes. And second, the communication buffers used by HPDA applications and DL frameworks generally reside on a GPU's memory. In this context, we observe that conventional MPI runtimes have been optimized over decades to achieve lowest possible communication latency for relatively smaller message sizes (up-to 1 Megabyte) and that too for CPU memory buffers. With the advent of CUDA-Aware MPI runtimes, a lot of research has been conducted to improve performance of GPU buffer based communication. However, little exists in current state of the art that deals with very large message communication of GPU buffers. In this paper, we investigate these new challenges by analyzing the performance bottlenecks in existing CUDA-Aware MPI runtimes like MVAPICH2-GDR, and propose hierarchical collective designs to improve communication latency of the MPI_Bcast primitive by exploiting a new communication library called NCCL. To the best of our knowledge, this is the first work that addresses these new requirements where GPU buffers are used for communication with message sizes surpassing hundreds of megabytes. We highlight the design challenges for our work along with the details of design and implementation. In addition, we provide a comprehensive performance evaluation using a Micro-benchmark and a CUDA-Aware adaptation of Microsoft CNTK DL framework. We report up to 47% improvement in training time for CNTK using the proposed hierarchical MPI_Bcast design.","PeriodicalId":264069,"journal":{"name":"Proceedings of the 23rd European MPI Users' Group Meeting","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"40","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 23rd European MPI Users' Group Meeting","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2966884.2966912","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 40
Abstract
Emerging paradigms like High Performance Data Analytics (HPDA) and Deep Learning (DL) pose at least two new design challenges for existing MPI runtimes. First, these paradigms require an efficient support for communicating unusually large messages across processes. And second, the communication buffers used by HPDA applications and DL frameworks generally reside on a GPU's memory. In this context, we observe that conventional MPI runtimes have been optimized over decades to achieve lowest possible communication latency for relatively smaller message sizes (up-to 1 Megabyte) and that too for CPU memory buffers. With the advent of CUDA-Aware MPI runtimes, a lot of research has been conducted to improve performance of GPU buffer based communication. However, little exists in current state of the art that deals with very large message communication of GPU buffers. In this paper, we investigate these new challenges by analyzing the performance bottlenecks in existing CUDA-Aware MPI runtimes like MVAPICH2-GDR, and propose hierarchical collective designs to improve communication latency of the MPI_Bcast primitive by exploiting a new communication library called NCCL. To the best of our knowledge, this is the first work that addresses these new requirements where GPU buffers are used for communication with message sizes surpassing hundreds of megabytes. We highlight the design challenges for our work along with the details of design and implementation. In addition, we provide a comprehensive performance evaluation using a Micro-benchmark and a CUDA-Aware adaptation of Microsoft CNTK DL framework. We report up to 47% improvement in training time for CNTK using the proposed hierarchical MPI_Bcast design.