Wenbin Jiang, Geyan Ye, L. Yang, Jian Zhu, Yang Ma, Xia Xie, Hai Jin
{"title":"A Novel Stochastic Gradient Descent Algorithm Based on Grouping over Heterogeneous Cluster Systems for Distributed Deep Learning","authors":"Wenbin Jiang, Geyan Ye, L. Yang, Jian Zhu, Yang Ma, Xia Xie, Hai Jin","doi":"10.1109/CCGRID.2019.00053","DOIUrl":null,"url":null,"abstract":"On heterogeneous cluster systems, the convergence performances of neural network models are greatly troubled by the different performances of machines. In this paper, we propose a novel distributed Stochastic Gradient Descent (SGD) algorithm named Grouping-SGD for distributed deep learning, which converges faster than Sync-SGD, Async-SGD, and Stale-SGD. In Grouping-SGD, machines are partitioned into multiple groups, ensuring that machines in the same group have similar performances. Machines in the same group update the models synchronously, while different groups update the models asynchronously. To improve the performance of Grouping-SGD further, the parameter servers are arranged from fast to slow, and they are responsible for updating the model parameters from the lower layer to the higher layer respectively. The experimental results indicate that Grouping-SGD can achieve 1.2-3.7 times speedups using popular image classification benchmarks: MNIST, Cifar10, Cifar100, and ImageNet, compared to Sync-SGD, Async-SGD, and Stale-SGD.","PeriodicalId":234571,"journal":{"name":"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGRID.2019.00053","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8
Abstract
On heterogeneous cluster systems, the convergence performances of neural network models are greatly troubled by the different performances of machines. In this paper, we propose a novel distributed Stochastic Gradient Descent (SGD) algorithm named Grouping-SGD for distributed deep learning, which converges faster than Sync-SGD, Async-SGD, and Stale-SGD. In Grouping-SGD, machines are partitioned into multiple groups, ensuring that machines in the same group have similar performances. Machines in the same group update the models synchronously, while different groups update the models asynchronously. To improve the performance of Grouping-SGD further, the parameter servers are arranged from fast to slow, and they are responsible for updating the model parameters from the lower layer to the higher layer respectively. The experimental results indicate that Grouping-SGD can achieve 1.2-3.7 times speedups using popular image classification benchmarks: MNIST, Cifar10, Cifar100, and ImageNet, compared to Sync-SGD, Async-SGD, and Stale-SGD.