A. Awan, J. Bédorf, Ching-Hsiang Chu, H. Subramoni, D. Panda
{"title":"Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation","authors":"A. Awan, J. Bédorf, Ching-Hsiang Chu, H. Subramoni, D. Panda","doi":"10.1109/CCGRID.2019.00064","DOIUrl":null,"url":null,"abstract":"The current wave of advances in Deep Learning (DL) have been triggered by the availability of large-scale datasets, efficient CPU and GPU hardware, and development of software frameworks like TensorFlow (TF). However, little exists in literature that addresses TensorFlow's distributed training capabilities. In this paper, we provide an in-depth performance characterization and design analysis for distributed TensorFlow. We present three key insights: 1) Horovod designs achieve better performance compared to the official gRPC-based approaches, 2) performance of Horovod design is heavily influenced by the time spent in gradient aggregation that uses the Allreduce primitive, and 3) performance of existing Horovod-MPI implementation is significantly worse compared to Horovod-NCCL. To address this limitation in Horovod-MPI, we propose a novel and efficient CUDA-Aware MPI Allreduce design that 1) exploits CUDA kernels to perform large reductions on the GPU, 2) uses a com-bination of bandwidth-optimal and latency-optimal algorithms, and 3) maintains a pointer cache to avoid CUDA-driver query overheads in the critical path. The proposed designs deliver 5×, 17×, and 29% better performance compared to NCCL2 for small, medium, and large messages. Our designs enable Horovod-MPI to beat state-of-the-art Horovod-NCCL2 by 3% and achieve 90% scaling efficiency for ResNet-50 training on 64 Pascal GPUs.","PeriodicalId":234571,"journal":{"name":"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"120 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"36","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGRID.2019.00064","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 36
Abstract
The current wave of advances in Deep Learning (DL) have been triggered by the availability of large-scale datasets, efficient CPU and GPU hardware, and development of software frameworks like TensorFlow (TF). However, little exists in literature that addresses TensorFlow's distributed training capabilities. In this paper, we provide an in-depth performance characterization and design analysis for distributed TensorFlow. We present three key insights: 1) Horovod designs achieve better performance compared to the official gRPC-based approaches, 2) performance of Horovod design is heavily influenced by the time spent in gradient aggregation that uses the Allreduce primitive, and 3) performance of existing Horovod-MPI implementation is significantly worse compared to Horovod-NCCL. To address this limitation in Horovod-MPI, we propose a novel and efficient CUDA-Aware MPI Allreduce design that 1) exploits CUDA kernels to perform large reductions on the GPU, 2) uses a com-bination of bandwidth-optimal and latency-optimal algorithms, and 3) maintains a pointer cache to avoid CUDA-driver query overheads in the critical path. The proposed designs deliver 5×, 17×, and 29% better performance compared to NCCL2 for small, medium, and large messages. Our designs enable Horovod-MPI to beat state-of-the-art Horovod-NCCL2 by 3% and achieve 90% scaling efficiency for ResNet-50 training on 64 Pascal GPUs.