Accelerating TensorFlow with Adaptive RDMA-Based gRPC

2018 IEEE 25th International Conference on High Performance Computing (HiPC) Pub Date : 2018-12-01 DOI:10.1109/HiPC.2018.00010

Rajarshi Biswas, Xiaoyi Lu, D. Panda

{"title":"Accelerating TensorFlow with Adaptive RDMA-Based gRPC","authors":"Rajarshi Biswas, Xiaoyi Lu, D. Panda","doi":"10.1109/HiPC.2018.00010","DOIUrl":null,"url":null,"abstract":"Google's TensorFlow is one of the most popular Deep Learning frameworks nowadays. Distributed TensorFlow supports various channels to efficiently transfer tensors, such as gRPC over TCP/IP, gRPC+Verbs, and gRPC+MPI. At present, the community lacks a thorough characterization of distributed TensorFlow communication channels. This is critical because high-performance Deep Learning with TensorFlow needs an efficient communication runtime. Thus, we conduct a thorough analysis of the communication characteristics of distributed TensorFlow. Our studies show that none of the existing channels in TensorFlow can support adaptive and efficient communication for Deep Learning workloads with different message sizes. Moreover, the community needs to maintain these different channels while the users are also expected to tune these channels to get the desired performance. Therefore, this paper proposes a unified approach to have a single gRPC runtime (i.e., AR-gRPC) in TensorFlow with Adaptive and efficient RDMA protocols. In AR-gRPC, we propose designs such as hybrid communication protocols, message pipelining and coalescing, zero-copy transmission etc. to make our runtime be adaptive to different message sizes for Deep Learning workloads. Our performance evaluations show that AR-gRPC can significantly speedup gRPC performance by up to 4.1x and 2.3x compared to the default gRPC design on IPoIB and another RDMA-based gRPC design in the community. Comet supercomputer shows that AR-gRPC design can reduce the Point-to-Point latency by up to 75% compared to the default gRPC design. By integrating our AR-gRPC with TensorFlow, we can achieve up to 3x distributed training speedup over default gRPC-IPoIB based TensorFlow.","PeriodicalId":113335,"journal":{"name":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HiPC.2018.00010","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 20

Abstract

Google's TensorFlow is one of the most popular Deep Learning frameworks nowadays. Distributed TensorFlow supports various channels to efficiently transfer tensors, such as gRPC over TCP/IP, gRPC+Verbs, and gRPC+MPI. At present, the community lacks a thorough characterization of distributed TensorFlow communication channels. This is critical because high-performance Deep Learning with TensorFlow needs an efficient communication runtime. Thus, we conduct a thorough analysis of the communication characteristics of distributed TensorFlow. Our studies show that none of the existing channels in TensorFlow can support adaptive and efficient communication for Deep Learning workloads with different message sizes. Moreover, the community needs to maintain these different channels while the users are also expected to tune these channels to get the desired performance. Therefore, this paper proposes a unified approach to have a single gRPC runtime (i.e., AR-gRPC) in TensorFlow with Adaptive and efficient RDMA protocols. In AR-gRPC, we propose designs such as hybrid communication protocols, message pipelining and coalescing, zero-copy transmission etc. to make our runtime be adaptive to different message sizes for Deep Learning workloads. Our performance evaluations show that AR-gRPC can significantly speedup gRPC performance by up to 4.1x and 2.3x compared to the default gRPC design on IPoIB and another RDMA-based gRPC design in the community. Comet supercomputer shows that AR-gRPC design can reduce the Point-to-Point latency by up to 75% compared to the default gRPC design. By integrating our AR-gRPC with TensorFlow, we can achieve up to 3x distributed training speedup over default gRPC-IPoIB based TensorFlow.

查看原文本刊更多论文

基于自适应rdma的gRPC加速TensorFlow

b谷歌的TensorFlow是当今最流行的深度学习框架之一。分布式TensorFlow支持多种通道来有效地传输张量，如gRPC通过TCP/IP, gRPC+Verbs和gRPC+MPI。目前，社区缺乏对分布式TensorFlow通信通道的全面表征。这是至关重要的，因为使用TensorFlow的高性能深度学习需要一个高效的通信运行时。因此，我们对分布式TensorFlow的通信特性进行了深入的分析。我们的研究表明，TensorFlow中现有的通道都不能支持具有不同消息大小的深度学习工作负载的自适应和高效通信。此外，社区需要维护这些不同的通道，而用户也需要调优这些通道以获得所需的性能。因此，本文提出了一种统一的方法，在TensorFlow中使用自适应和高效的RDMA协议实现单个gRPC运行时(即AR-gRPC)。在AR-gRPC中，我们提出了混合通信协议、消息管道和合并、零拷贝传输等设计，使我们的运行时能够适应深度学习工作负载的不同消息大小。我们的性能评估表明，与IPoIB上的默认gRPC设计和社区中另一种基于rdma的gRPC设计相比，AR-gRPC可以显着提高gRPC的性能，最高可达4.1倍和2.3倍。Comet超级计算机表明，AR-gRPC设计与默认gRPC设计相比，可以减少高达75%的点对点延迟。通过将AR-gRPC与TensorFlow集成，我们可以比默认的基于gRPC-IPoIB的TensorFlow实现高达3倍的分布式训练加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 IEEE 25th International Conference on High Performance Computing (HiPC)

自引率

0.00%

发文量