Adrián Castelló, M. F. Dolz, E. S. Quintana‐Ortí, J. Duato
{"title":"Theoretical Scalability Analysis of Distributed Deep Convolutional Neural Networks","authors":"Adrián Castelló, M. F. Dolz, E. S. Quintana‐Ortí, J. Duato","doi":"10.1109/CCGRID.2019.00068","DOIUrl":null,"url":null,"abstract":"We analyze the asymptotic performance of the training process of deep neural networks (NN) on clusters in order to determine the scalability. For this purpose, i) we assume a data parallel implementation of the training algorithm, which distributes the batches among the cluster nodes and replicates the model; ii) we leverage the roofline model to inspect the performance at the node level, taking into account the floating-point unit throughput and memory bandwidth; and iii) we consider distinct collective communication schemes that are optimal depending on the message size and underlying network interconnection topology. We then apply the resulting performance model to analyze the scalability of several well-known deep convolutional neural networks as a function of the batch size, node floating-point throughput, node memory bandwidth, cluster dimension, and link bandwidth.","PeriodicalId":234571,"journal":{"name":"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGRID.2019.00068","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 12
Abstract
We analyze the asymptotic performance of the training process of deep neural networks (NN) on clusters in order to determine the scalability. For this purpose, i) we assume a data parallel implementation of the training algorithm, which distributes the batches among the cluster nodes and replicates the model; ii) we leverage the roofline model to inspect the performance at the node level, taking into account the floating-point unit throughput and memory bandwidth; and iii) we consider distinct collective communication schemes that are optimal depending on the message size and underlying network interconnection topology. We then apply the resulting performance model to analyze the scalability of several well-known deep convolutional neural networks as a function of the batch size, node floating-point throughput, node memory bandwidth, cluster dimension, and link bandwidth.