Adaptive Batch Size Time Evolving Stochastic Gradient Descent for Federated Learning.

IF 18.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Pattern Analysis and Machine Intelligence Pub Date : 2025-09-15 DOI:10.1109/tpami.2025.3610169

Xuming An,Li Shen,Yong Luo,Han Hu,Dacheng Tao

{"title":"Adaptive Batch Size Time Evolving Stochastic Gradient Descent for Federated Learning.","authors":"Xuming An,Li Shen,Yong Luo,Han Hu,Dacheng Tao","doi":"10.1109/tpami.2025.3610169","DOIUrl":null,"url":null,"abstract":"Variance reduction has been shown to improve the performance of Stochastic Gradient Descent (SGD) in centralized machine learning. However, when it is extended to federated learning systems, many issues may arise, including (i) mega-batch size settings; (ii) additional noise introduced by the gradient difference between the current iteration and the snapshot point; and (iii) gradient (statistical) heterogeneity. In this paper, we propose a lightweight algorithm termed federated adaptive batch size time evolving variance reduction (FedATEVR) to tackle these issues, consisting of an adaptive batch size setting scheme and a time-evolving variance reduction gradient estimator. In particular, we use the historical gradient information to set an appropriate mega-batch size for each client, which can steadily accelerate the local SGD process and reduce the computation cost. The historical information involves both global and local gradient, which mitigates unstable varying in mega-batch size introduced by gradient heterogeneity among the clients. For each client, the gradient difference between the current iteration and the snapshot point is used to tune the time-evolving weight of the variance reduction term in the gradient estimator. This can avoid meaningless variance reduction caused by the out-of-date snapshot point gradient. We theoretically prove that our algorithm can achieve a linear speedup of of $\\mathcal {O}(\\frac{1}{\\sqrt{SKT}})$ for non-convex objective functions under partial client participation. Extensive experiments demonstrate that our proposed method can achieve higher test accuracy than the baselines and decrease communication rounds greatly.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"65 1","pages":""},"PeriodicalIF":18.6000,"publicationDate":"2025-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Pattern Analysis and Machine Intelligence","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/tpami.2025.3610169","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Variance reduction has been shown to improve the performance of Stochastic Gradient Descent (SGD) in centralized machine learning. However, when it is extended to federated learning systems, many issues may arise, including (i) mega-batch size settings; (ii) additional noise introduced by the gradient difference between the current iteration and the snapshot point; and (iii) gradient (statistical) heterogeneity. In this paper, we propose a lightweight algorithm termed federated adaptive batch size time evolving variance reduction (FedATEVR) to tackle these issues, consisting of an adaptive batch size setting scheme and a time-evolving variance reduction gradient estimator. In particular, we use the historical gradient information to set an appropriate mega-batch size for each client, which can steadily accelerate the local SGD process and reduce the computation cost. The historical information involves both global and local gradient, which mitigates unstable varying in mega-batch size introduced by gradient heterogeneity among the clients. For each client, the gradient difference between the current iteration and the snapshot point is used to tune the time-evolving weight of the variance reduction term in the gradient estimator. This can avoid meaningless variance reduction caused by the out-of-date snapshot point gradient. We theoretically prove that our algorithm can achieve a linear speedup of of $\mathcal {O}(\frac{1}{\sqrt{SKT}})$ for non-convex objective functions under partial client participation. Extensive experiments demonstrate that our proposed method can achieve higher test accuracy than the baselines and decrease communication rounds greatly.

查看原文本刊更多论文

用于联邦学习的自适应批大小时间演化随机梯度下降。

在集中式机器学习中，方差减少已被证明可以提高随机梯度下降（SGD）的性能。然而，当它扩展到联邦学习系统时，可能会出现许多问题，包括(i)超大批大小设置；（ii）当前迭代与快照点之间的梯度差所引入的额外噪声；（三）梯度（统计）异质性。在本文中，我们提出了一种轻量级的算法，称为联邦自适应批大小时间演化方差减少（FedATEVR）来解决这些问题，该算法由自适应批大小设置方案和时间演化方差减少梯度估计器组成。特别是，我们使用历史梯度信息为每个客户端设置合适的超大批大小，可以稳定地加速本地SGD过程并降低计算成本。历史信息包括全局和局部梯度，这减轻了由于客户端之间的梯度异质性而导致的超大批大小的不稳定变化。对于每个客户端，使用当前迭代和快照点之间的梯度差来调整梯度估计器中方差缩减项的随时间变化的权重。这可以避免由于过时的快照点梯度而导致的无意义的方差减少。从理论上证明，对于部分客户参与下的非凸目标函数，我们的算法可以实现$\mathcal {O}(\frac{1}{\sqrt{SKT}})$的线性加速。大量的实验表明，该方法可以获得比基线更高的测试精度，并且大大减少了通信轮数。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Pattern Analysis and Machine Intelligence 工程技术-工程：电子与电气

CiteScore

28.40

自引率

3.00%

发文量

885

审稿时长

8.5 months

期刊介绍： The IEEE Transactions on Pattern Analysis and Machine Intelligence publishes articles on all traditional areas of computer vision and image understanding, all traditional areas of pattern analysis and recognition, and selected areas of machine intelligence, with a particular emphasis on machine learning for pattern analysis. Areas such as techniques for visual search, document and handwriting analysis, medical image analysis, video and image sequence analysis, content-based retrieval of image and video, face and gesture recognition and relevant specialized hardware and/or software architectures are also covered.