The Effectiveness of Local Updates for Decentralized Learning Under Data Heterogeneity

IF 4.6 2区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Signal Processing Pub Date : 2025-01-24 DOI:10.1109/TSP.2025.3533208

Tongle Wu;Zhize Li;Ying Sun

{"title":"The Effectiveness of Local Updates for Decentralized Learning Under Data Heterogeneity","authors":"Tongle Wu;Zhize Li;Ying Sun","doi":"10.1109/TSP.2025.3533208","DOIUrl":null,"url":null,"abstract":"We revisit two fundamental decentralized optimization methods, Decentralized Gradient Tracking (DGT) and Decentralized Gradient Descent (DGD), with multiple local updates. We consider two settings and demonstrate that incorporating local update steps can reduce communication complexity. Specifically, for <inline-formula><tex-math>$\\mu$</tex-math></inline-formula>-strongly convex and <inline-formula><tex-math>$L$</tex-math></inline-formula>-smooth loss functions, we proved that local DGT achieves communication complexity <inline-formula><tex-math>$\\tilde{\\mathcal{O}}\\Big{(}\\frac{L}{\\mu(K+1)}+\\frac{\\delta+{}{\\mu}}{\\mu(1-\\rho)}+\\frac{\\rho}{(1-\\rho)^{2}}\\cdot\\frac{L+\\delta}{\\mu}\\Big{)}$</tex-math></inline-formula>, where <inline-formula><tex-math>$K$</tex-math></inline-formula> is the number of additional local update, <inline-formula><tex-math>$\\rho$</tex-math></inline-formula> measures the network connectivity and <inline-formula><tex-math>$\\delta$</tex-math></inline-formula> measures the second-order heterogeneity of the local losses. Our results reveal the tradeoff between communication and computation and show increasing <inline-formula><tex-math>$K$</tex-math></inline-formula> can effectively reduce communication costs when the data heterogeneity is low and the network is well-connected. We then consider the over-parameterization regime where the local losses share the same minimums. We proved that employing local updates in DGD, even without gradient correction, achieves exact linear convergence under the Polyak-Łojasiewicz (PL) condition, which can yield a similar effect as DGT in reducing communication complexity. Customization of the result to linear models is further provided, with improved rate expression. Numerical experiments validate our theoretical results.","PeriodicalId":13330,"journal":{"name":"IEEE Transactions on Signal Processing","volume":"73 ","pages":"751-765"},"PeriodicalIF":4.6000,"publicationDate":"2025-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10852183/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

We revisit two fundamental decentralized optimization methods, Decentralized Gradient Tracking (DGT) and Decentralized Gradient Descent (DGD), with multiple local updates. We consider two settings and demonstrate that incorporating local update steps can reduce communication complexity. Specifically, for

$\mu$

-strongly convex and

$L$

-smooth loss functions, we proved that local DGT achieves communication complexity

$\tilde{\mathcal{O}}\Big{(}\frac{L}{\mu(K+1)}+\frac{\delta+{}{\mu}}{\mu(1-\rho)}+\frac{\rho}{(1-\rho)^{2}}\cdot\frac{L+\delta}{\mu}\Big{)}$

, where

$K$

is the number of additional local update,

$\rho$

measures the network connectivity and

$\delta$

measures the second-order heterogeneity of the local losses. Our results reveal the tradeoff between communication and computation and show increasing

$K$

can effectively reduce communication costs when the data heterogeneity is low and the network is well-connected. We then consider the over-parameterization regime where the local losses share the same minimums. We proved that employing local updates in DGD, even without gradient correction, achieves exact linear convergence under the Polyak-Łojasiewicz (PL) condition, which can yield a similar effect as DGT in reducing communication complexity. Customization of the result to linear models is further provided, with improved rate expression. Numerical experiments validate our theoretical results.

查看原文本刊更多论文

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Signal Processing 工程技术-工程：电子与电气

CiteScore

11.20

自引率

9.30%

发文量

310

审稿时长

3.0 months

期刊介绍： The IEEE Transactions on Signal Processing covers novel theory, algorithms, performance analyses and applications of techniques for the processing, understanding, learning, retrieval, mining, and extraction of information from signals. The term “signal” includes, among others, audio, video, speech, image, communication, geophysical, sonar, radar, medical and musical signals. Examples of topics of interest include, but are not limited to, information processing and the theory and application of filtering, coding, transmitting, estimating, detecting, analyzing, recognizing, synthesizing, recording, and reproducing signals.