Automatic Graph Partitioning for Very Large-scale Deep Learning

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2021-03-30 DOI:10.1109/IPDPS49936.2021.00109

Masahiro Tanaka, K. Taura, T. Hanawa, Kentaro Torisawa

{"title":"Automatic Graph Partitioning for Very Large-scale Deep Learning","authors":"Masahiro Tanaka, K. Taura, T. Hanawa, Kentaro Torisawa","doi":"10.1109/IPDPS49936.2021.00109","DOIUrl":null,"url":null,"abstract":"This work proposes RaNNC (Rapid Neural Network Connector) as middleware for automatic hybrid parallelism. In recent deep learning research, as exemplified by T5 and GPT-3, the size of neural network models continues to grow. Since such models do not fit into the memory of accelerator devices, they need to be partitioned by model parallelism techniques. Moreover, to accelerate training for huge training data, we need a combination of model and data parallelisms, i.e., hybrid parallelism. Given a model description for PyTorch without any specification for model parallelism, RaNNC automatically partitions the model into a set of subcomponents so that (1) each subcomponent fits a device memory and (2) a high training throughput for pipeline parallelism is achieved by balancing the computation times of the subcomponents. Since the search space for partitioning models can be extremely large, RaNNC partitions a model through the following three phases. First, it identifies atomic subcomponents using simple heuristic rules. Next it groups them into coarser-grained blocks while balancing their computation times. Finally, it uses a novel dynamic programming-based algorithm to efficiently search for combinations of blocks to determine the final partitions. In our experiments, we compared RaNNC with two popular frameworks, Megatron-LM (hybrid parallelism) and GPipe (originally proposed for model parallelism, but a version allowing hybrid parallelism also exists), for training models with increasingly greater numbers of parameters. In the pre-training of enlarged BERT models, RaNNC successfully trained models five times larger than those Megatron-LM could, and RaNNC’s training throughputs were comparable to Megatron-LM’s when pre-training the same models. RaNNC also achieved better training throughputs than GPipe on both the enlarged BERT model pre-training (GPipe with hybrid parallelism) and the enlarged ResNet models (GPipe with model parallelism) in all of the settings we tried. These results are remarkable, since RaNNC automatically partitions models without any modification to their descriptions; Megatron-LM and GPipe require users to manually rewrite the models’ descriptions.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"81 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS49936.2021.00109","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

Abstract

This work proposes RaNNC (Rapid Neural Network Connector) as middleware for automatic hybrid parallelism. In recent deep learning research, as exemplified by T5 and GPT-3, the size of neural network models continues to grow. Since such models do not fit into the memory of accelerator devices, they need to be partitioned by model parallelism techniques. Moreover, to accelerate training for huge training data, we need a combination of model and data parallelisms, i.e., hybrid parallelism. Given a model description for PyTorch without any specification for model parallelism, RaNNC automatically partitions the model into a set of subcomponents so that (1) each subcomponent fits a device memory and (2) a high training throughput for pipeline parallelism is achieved by balancing the computation times of the subcomponents. Since the search space for partitioning models can be extremely large, RaNNC partitions a model through the following three phases. First, it identifies atomic subcomponents using simple heuristic rules. Next it groups them into coarser-grained blocks while balancing their computation times. Finally, it uses a novel dynamic programming-based algorithm to efficiently search for combinations of blocks to determine the final partitions. In our experiments, we compared RaNNC with two popular frameworks, Megatron-LM (hybrid parallelism) and GPipe (originally proposed for model parallelism, but a version allowing hybrid parallelism also exists), for training models with increasingly greater numbers of parameters. In the pre-training of enlarged BERT models, RaNNC successfully trained models five times larger than those Megatron-LM could, and RaNNC’s training throughputs were comparable to Megatron-LM’s when pre-training the same models. RaNNC also achieved better training throughputs than GPipe on both the enlarged BERT model pre-training (GPipe with hybrid parallelism) and the enlarged ResNet models (GPipe with model parallelism) in all of the settings we tried. These results are remarkable, since RaNNC automatically partitions models without any modification to their descriptions; Megatron-LM and GPipe require users to manually rewrite the models’ descriptions.

查看原文本刊更多论文

用于大规模深度学习的自动图划分

本文提出RaNNC (Rapid Neural Network Connector)作为自动混合并行的中间件。在最近的深度学习研究中，以T5和GPT-3为例，神经网络模型的规模持续增长。由于这些模型不适合加速器设备的内存，因此需要通过模型并行技术对它们进行分区。此外，为了加速海量训练数据的训练，我们需要模型并行和数据并行的结合，即混合并行。给定PyTorch没有任何模型并行性规范的模型描述，RaNNC自动将模型划分为一组子组件，以便(1)每个子组件适合设备内存;(2)通过平衡子组件的计算时间来实现管道并行性的高训练吞吐量。由于分区模型的搜索空间可能非常大，因此RaNNC通过以下三个阶段对模型进行分区。首先，它使用简单的启发式规则标识原子子组件。接下来，它将它们分组到更粗粒度的块中，同时平衡它们的计算时间。最后，采用一种新颖的基于动态规划的算法，有效地搜索块的组合以确定最终分区。在我们的实验中，我们将RaNNC与两个流行的框架进行了比较，Megatron-LM(混合并行)和GPipe(最初提出用于模型并行，但也存在允许混合并行的版本)，用于训练具有越来越多参数的模型。在放大BERT模型的预训练中，RaNNC成功训练的模型是Megatron-LM的5倍，并且在预训练相同模型时，RaNNC的训练吞吐量与Megatron-LM相当。在我们尝试的所有设置中，RaNNC在扩大的BERT模型预训练(GPipe具有混合并行性)和扩大的ResNet模型(GPipe具有模型并行性)上也取得了比GPipe更好的训练吞吐量。这些结果是显著的，因为RaNNC在不修改模型描述的情况下自动划分模型;Megatron-LM和GPipe需要用户手动重写模型的描述。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

自引率

0.00%

发文量