Distributed Stochastic Gradient Descent With Staleness: A Stochastic Delay Differential Equation Based Framework

IF 4.6 2区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC
Siyuan Yu;Wei Chen;H. Vincent Poor
{"title":"Distributed Stochastic Gradient Descent With Staleness: A Stochastic Delay Differential Equation Based Framework","authors":"Siyuan Yu;Wei Chen;H. Vincent Poor","doi":"10.1109/TSP.2025.3546574","DOIUrl":null,"url":null,"abstract":"Distributed stochastic gradient descent (SGD) has attracted considerable recent attention due to its potential for scaling computational resources, reducing training time, and helping protect user privacy in machine learning. However, stragglers and limited bandwidth may induce random computational/ communication delays, thereby severely hindering the learning process. Therefore, how to accelerate asynchronous SGD (ASGD) by efficiently scheduling multiple workers is an important issue. In this paper, a unified framework is presented to analyze and optimize the convergence of ASGD based on stochastic delay differential equations (SDDEs) and the Poisson approximation of aggregated gradient arrivals. In particular, we present the run time and staleness of distributed SGD without a memorylessness assumption on the computation times. Given the learning rate, we reveal the relevant SDDE's damping coefficient and its delay statistics, as functions of the number of activated clients, staleness threshold, the eigenvalues of the Hessian matrix of the objective function, and the overall computational/communication delay. The formulated SDDE allows us to present both the distributed SGD's convergence condition and speed by calculating its characteristic roots, thereby optimizing the scheduling policies for asynchronous/event-triggered SGD. It is interestingly shown that increasing the number of activated workers does not necessarily accelerate distributed SGD due to staleness. Moreover, a small degree of staleness does not necessarily slow down the convergence, while a large degree of staleness will result in the divergence of distributed SGD. Numerical results demonstrate the potential of our SDDE framework, even in complex learning tasks with non-convex objective functions.","PeriodicalId":13330,"journal":{"name":"IEEE Transactions on Signal Processing","volume":"73 ","pages":"1708-1726"},"PeriodicalIF":4.6000,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10909566/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

Abstract

Distributed stochastic gradient descent (SGD) has attracted considerable recent attention due to its potential for scaling computational resources, reducing training time, and helping protect user privacy in machine learning. However, stragglers and limited bandwidth may induce random computational/ communication delays, thereby severely hindering the learning process. Therefore, how to accelerate asynchronous SGD (ASGD) by efficiently scheduling multiple workers is an important issue. In this paper, a unified framework is presented to analyze and optimize the convergence of ASGD based on stochastic delay differential equations (SDDEs) and the Poisson approximation of aggregated gradient arrivals. In particular, we present the run time and staleness of distributed SGD without a memorylessness assumption on the computation times. Given the learning rate, we reveal the relevant SDDE's damping coefficient and its delay statistics, as functions of the number of activated clients, staleness threshold, the eigenvalues of the Hessian matrix of the objective function, and the overall computational/communication delay. The formulated SDDE allows us to present both the distributed SGD's convergence condition and speed by calculating its characteristic roots, thereby optimizing the scheduling policies for asynchronous/event-triggered SGD. It is interestingly shown that increasing the number of activated workers does not necessarily accelerate distributed SGD due to staleness. Moreover, a small degree of staleness does not necessarily slow down the convergence, while a large degree of staleness will result in the divergence of distributed SGD. Numerical results demonstrate the potential of our SDDE framework, even in complex learning tasks with non-convex objective functions.
具有时滞的分布随机梯度下降:一个基于随机延迟微分方程的框架
分布式随机梯度下降(SGD)由于其在扩展计算资源、减少训练时间和帮助保护机器学习中的用户隐私方面的潜力,最近引起了相当大的关注。然而,离散体和有限的带宽可能会导致随机的计算/通信延迟,从而严重阻碍学习过程。因此,如何通过高效地调度多个worker来加速异步SGD (ASGD)是一个重要的问题。本文提出了一个基于随机延迟微分方程(SDDEs)和聚集梯度到达泊松逼近的统一框架来分析和优化ASGD的收敛性。特别地,我们在没有计算时间无内存假设的情况下给出了分布式SGD的运行时间和过时性。在给定学习率的情况下,我们揭示了相关SDDE的阻尼系数及其延迟统计量,作为激活客户端数量、失效阈值、目标函数的Hessian矩阵特征值以及总体计算/通信延迟的函数。公式化的SDDE允许我们通过计算其特征根来表示分布式SGD的收敛条件和速度,从而优化异步/事件触发SGD的调度策略。有趣的是,由于陈旧,增加激活的工人数量不一定会加速分布式SGD。而且,小程度的陈旧并不一定会减慢收敛速度,而大程度的陈旧会导致分布式SGD的发散。数值结果证明了SDDE框架的潜力,即使在具有非凸目标函数的复杂学习任务中也是如此。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
IEEE Transactions on Signal Processing
IEEE Transactions on Signal Processing 工程技术-工程:电子与电气
CiteScore
11.20
自引率
9.30%
发文量
310
审稿时长
3.0 months
期刊介绍: The IEEE Transactions on Signal Processing covers novel theory, algorithms, performance analyses and applications of techniques for the processing, understanding, learning, retrieval, mining, and extraction of information from signals. The term “signal” includes, among others, audio, video, speech, image, communication, geophysical, sonar, radar, medical and musical signals. Examples of topics of interest include, but are not limited to, information processing and the theory and application of filtering, coding, transmitting, estimating, detecting, analyzing, recognizing, synthesizing, recording, and reproducing signals.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信