HiPS: Hierarchical Parameter Synchronization in Large-Scale Distributed Machine Learning

Proceedings of the 2018 Workshop on Network Meets AI & ML Pub Date : 2018-08-07 DOI:10.1145/3229543.3229544

Jinkun Geng, Dan Li, Yang Cheng, Shuai Wang, Junfeng Li

引用次数: 25

Abstract

In large-scale distributed machine learning (DML) system, parameter (gradient) synchronization among machines plays an important role in improving the DML performance. State-of-the-art DML synchronization algorithms, either the parameter server (PS) based algorithm or the ring allreduce algorithm, work in a flat way and suffer when the network size is large. In this work, we propose HiPS, a hierarchical parameter (gradient) synchronization framework in large-scale DML. In HiPS, server-centric network topology is used to better embrace RDMA/RoCE transport between machines, and the parameters (gradients) are synchronized in a hierarchical and hybrid way. Our evaluation in BCube and Torus network demonstrates that HiPS can better match server-centric networks. Compared with the flat algorithms (PS-based and ring-based), HiPS reduces the synchronization time by 73% and 75% respectively.

查看原文本刊更多论文

HiPS:大规模分布式机器学习中的分层参数同步

在大规模分布式机器学习(DML)系统中，机器间的参数(梯度)同步对提高DML性能起着重要作用。最先进的DML同步算法，无论是基于参数服务器(PS)的算法还是环allreduce算法，都以平坦的方式工作，并且在网络规模较大时受到影响。在这项工作中，我们提出了HiPS，一种大规模DML中的分层参数(梯度)同步框架。在HiPS中，以服务器为中心的网络拓扑结构用于更好地支持机器之间的RDMA/RoCE传输，并且参数(梯度)以分层和混合的方式同步。我们对BCube和Torus网络的评估表明，HiPS可以更好地匹配以服务器为中心的网络。与平面算法(基于ps和基于环)相比，HiPS分别减少了73%和75%的同步时间。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2018 Workshop on Network Meets AI & ML

自引率

0.00%

发文量