{"title":"HSP: Hybrid Synchronous Parallelism for Fast Distributed Deep Learning","authors":"Yijun Li, Jiawei Huang, Zhaoyi Li, Shengwen Zhou, Wanchun Jiang, Jianxin Wang","doi":"10.1145/3545008.3545024","DOIUrl":null,"url":null,"abstract":"In the parameter-server-based distributed deep learning system, the workers simultaneously communicate with the parameter server to refine model parameters, easily resulting in severe network contention. To solve this problem, Asynchronous Parallel (ASP) strategy enables each worker to update the parameter independently without synchronization. However, due to the inconsistency of parameters among workers, ASP experiences accuracy loss and slow convergence. In this paper, we propose Hybrid Synchronous Parallelism (HSP), which mitigates the communication contention without excessive degradation of convergence speed. Specifically, the parameter server sequentially pulls gradients from workers to eliminate network congestion and synchronizes all up-to-date parameters after each iteration. Meanwhile, HSP cautiously lets idle workers to compute with out-of-date weights to maximize the utilizations of computing resources. We provide theoretical analysis of convergence efficiency and implement HSP on popular deep learning (DL) framework. The test results show that HSP improves the convergence speedup of three classical deep learning models by up to 67%.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 51st International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3545008.3545024","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
In the parameter-server-based distributed deep learning system, the workers simultaneously communicate with the parameter server to refine model parameters, easily resulting in severe network contention. To solve this problem, Asynchronous Parallel (ASP) strategy enables each worker to update the parameter independently without synchronization. However, due to the inconsistency of parameters among workers, ASP experiences accuracy loss and slow convergence. In this paper, we propose Hybrid Synchronous Parallelism (HSP), which mitigates the communication contention without excessive degradation of convergence speed. Specifically, the parameter server sequentially pulls gradients from workers to eliminate network congestion and synchronizes all up-to-date parameters after each iteration. Meanwhile, HSP cautiously lets idle workers to compute with out-of-date weights to maximize the utilizations of computing resources. We provide theoretical analysis of convergence efficiency and implement HSP on popular deep learning (DL) framework. The test results show that HSP improves the convergence speedup of three classical deep learning models by up to 67%.