分布式神经网络模型并行性分析

Proceedings of the 26th European MPI Users' Group Meeting Pub Date : 2019-09-11 DOI:10.1145/3343211.3343218

Adrián Castelló, M. F. Dolz, E. S. Quintana‐Ortí, J. Duato

{"title":"分布式神经网络模型并行性分析","authors":"Adrián Castelló, M. F. Dolz, E. S. Quintana‐Ortí, J. Duato","doi":"10.1145/3343211.3343218","DOIUrl":null,"url":null,"abstract":"We analyze the performance of model parallelism applied to the training of deep neural networks on clusters. For this study, we elaborate a parameterized analytical performance model that captures the main computational and communication stages in distributed model parallel training. This model is then leveraged to assess the impact on the performance of four representative convolutional neural networks (CNNs) when varying the node throughput in terms of operations per second and memory bandwidth, the number of nodes of the cluster, the bandwidth of the network links, and algorithmic parameters such as the dimension of the batch. As a second contribution of this paper, we discuss the need for specialized collective communication variants of the MPI_Allgather and MPI_Allreduce primitives where the number of \"contributing\" processes differs from the number of processes receiving a copy/part of the result during training. Furthermore, we analyze the effect that the actual implementation of the algorithms underlying the collective communication primitives exert on the performance of the distributed model parallel realization of the selected CNNs.","PeriodicalId":314904,"journal":{"name":"Proceedings of the 26th European MPI Users' Group Meeting","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"Analysis of model parallelism for distributed neural networks\",\"authors\":\"Adrián Castelló, M. F. Dolz, E. S. Quintana‐Ortí, J. Duato\",\"doi\":\"10.1145/3343211.3343218\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We analyze the performance of model parallelism applied to the training of deep neural networks on clusters. For this study, we elaborate a parameterized analytical performance model that captures the main computational and communication stages in distributed model parallel training. This model is then leveraged to assess the impact on the performance of four representative convolutional neural networks (CNNs) when varying the node throughput in terms of operations per second and memory bandwidth, the number of nodes of the cluster, the bandwidth of the network links, and algorithmic parameters such as the dimension of the batch. As a second contribution of this paper, we discuss the need for specialized collective communication variants of the MPI_Allgather and MPI_Allreduce primitives where the number of \\\"contributing\\\" processes differs from the number of processes receiving a copy/part of the result during training. Furthermore, we analyze the effect that the actual implementation of the algorithms underlying the collective communication primitives exert on the performance of the distributed model parallel realization of the selected CNNs.\",\"PeriodicalId\":314904,\"journal\":{\"name\":\"Proceedings of the 26th European MPI Users' Group Meeting\",\"volume\":\"5 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-09-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 26th European MPI Users' Group Meeting\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3343211.3343218\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 26th European MPI Users' Group Meeting","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3343211.3343218","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

摘要

分析了模型并行性在深度神经网络聚类训练中的应用。在这项研究中，我们详细阐述了一个参数化的分析性能模型，该模型捕捉了分布式模型并行训练中的主要计算和通信阶段。然后利用该模型来评估在改变节点吞吐量时对四个代表性卷积神经网络(cnn)性能的影响，这些吞吐量包括每秒操作数和内存带宽、集群的节点数量、网络链接的带宽和算法参数(如批处理的维度)。作为本文的第二个贡献，我们讨论了MPI_Allgather和MPI_Allreduce原语的专用集体通信变体的需求，其中“贡献”进程的数量不同于在训练期间接收结果副本/部分的进程的数量。此外，我们分析了基于集体通信原语的算法的实际实现对所选cnn的分布式模型并行实现性能的影响。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Analysis of model parallelism for distributed neural networks

We analyze the performance of model parallelism applied to the training of deep neural networks on clusters. For this study, we elaborate a parameterized analytical performance model that captures the main computational and communication stages in distributed model parallel training. This model is then leveraged to assess the impact on the performance of four representative convolutional neural networks (CNNs) when varying the node throughput in terms of operations per second and memory bandwidth, the number of nodes of the cluster, the bandwidth of the network links, and algorithmic parameters such as the dimension of the batch. As a second contribution of this paper, we discuss the need for specialized collective communication variants of the MPI_Allgather and MPI_Allreduce primitives where the number of "contributing" processes differs from the number of processes receiving a copy/part of the result during training. Furthermore, we analyze the effect that the actual implementation of the algorithms underlying the collective communication primitives exert on the performance of the distributed model parallel realization of the selected CNNs.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 26th European MPI Users' Group Meeting

自引率

0.00%

发文量