Adrián Castelló, M. F. Dolz, E. S. Quintana‐Ortí, J. Duato
{"title":"分布式神经网络模型并行性分析","authors":"Adrián Castelló, M. F. Dolz, E. S. Quintana‐Ortí, J. Duato","doi":"10.1145/3343211.3343218","DOIUrl":null,"url":null,"abstract":"We analyze the performance of model parallelism applied to the training of deep neural networks on clusters. For this study, we elaborate a parameterized analytical performance model that captures the main computational and communication stages in distributed model parallel training. This model is then leveraged to assess the impact on the performance of four representative convolutional neural networks (CNNs) when varying the node throughput in terms of operations per second and memory bandwidth, the number of nodes of the cluster, the bandwidth of the network links, and algorithmic parameters such as the dimension of the batch. As a second contribution of this paper, we discuss the need for specialized collective communication variants of the MPI_Allgather and MPI_Allreduce primitives where the number of \"contributing\" processes differs from the number of processes receiving a copy/part of the result during training. Furthermore, we analyze the effect that the actual implementation of the algorithms underlying the collective communication primitives exert on the performance of the distributed model parallel realization of the selected CNNs.","PeriodicalId":314904,"journal":{"name":"Proceedings of the 26th European MPI Users' Group Meeting","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"Analysis of model parallelism for distributed neural networks\",\"authors\":\"Adrián Castelló, M. F. Dolz, E. S. Quintana‐Ortí, J. Duato\",\"doi\":\"10.1145/3343211.3343218\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We analyze the performance of model parallelism applied to the training of deep neural networks on clusters. For this study, we elaborate a parameterized analytical performance model that captures the main computational and communication stages in distributed model parallel training. This model is then leveraged to assess the impact on the performance of four representative convolutional neural networks (CNNs) when varying the node throughput in terms of operations per second and memory bandwidth, the number of nodes of the cluster, the bandwidth of the network links, and algorithmic parameters such as the dimension of the batch. As a second contribution of this paper, we discuss the need for specialized collective communication variants of the MPI_Allgather and MPI_Allreduce primitives where the number of \\\"contributing\\\" processes differs from the number of processes receiving a copy/part of the result during training. Furthermore, we analyze the effect that the actual implementation of the algorithms underlying the collective communication primitives exert on the performance of the distributed model parallel realization of the selected CNNs.\",\"PeriodicalId\":314904,\"journal\":{\"name\":\"Proceedings of the 26th European MPI Users' Group Meeting\",\"volume\":\"5 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-09-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 26th European MPI Users' Group Meeting\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3343211.3343218\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 26th European MPI Users' Group Meeting","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3343211.3343218","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Analysis of model parallelism for distributed neural networks
We analyze the performance of model parallelism applied to the training of deep neural networks on clusters. For this study, we elaborate a parameterized analytical performance model that captures the main computational and communication stages in distributed model parallel training. This model is then leveraged to assess the impact on the performance of four representative convolutional neural networks (CNNs) when varying the node throughput in terms of operations per second and memory bandwidth, the number of nodes of the cluster, the bandwidth of the network links, and algorithmic parameters such as the dimension of the batch. As a second contribution of this paper, we discuss the need for specialized collective communication variants of the MPI_Allgather and MPI_Allreduce primitives where the number of "contributing" processes differs from the number of processes receiving a copy/part of the result during training. Furthermore, we analyze the effect that the actual implementation of the algorithms underlying the collective communication primitives exert on the performance of the distributed model parallel realization of the selected CNNs.