R-FAST：通用拓扑上的鲁棒全同步随机梯度跟踪

IF 3 3区计算机科学 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Signal and Information Processing over Networks Pub Date : 2024-08-30 DOI:10.1109/TSIPN.2024.3444484

Zehan Zhu;Ye Tian;Yan Huang;Jinming Xu;Shibo He

{"title":"R-FAST：通用拓扑上的鲁棒全同步随机梯度跟踪","authors":"Zehan Zhu;Ye Tian;Yan Huang;Jinming Xu;Shibo He","doi":"10.1109/TSIPN.2024.3444484","DOIUrl":null,"url":null,"abstract":"We propose a Robust Fully-Asynchronous Stochastic Gradient Tracking method (R-FAST) for distributed machine learning problems over a network of nodes, where each node performs local computation and communication at its own pace without any form of synchronization. Different from existing asynchronous distributed algorithms, R-FAST can eliminate the impact of data heterogeneity across nodes on convergence performance and allow for packet losses by employing a robust gradient tracking strategy that relies on properly designed auxiliary variables for tracking and buffering the overall gradient vector. Moreover, the proposed method utilizes two spanning-tree graphs for communication so long as both share at least one common root, enabling flexible designs in communication topologies. We show that R-FAST converges in expectation to a neighborhood of the optimum with a geometric rate for smooth and strongly convex objectives; and to a stationary point with a sublinear rate for general non-convex problems. Extensive experiments demonstrate that R-FAST runs 1.5-2 times faster than synchronous benchmark algorithms, such as Ring-AllReduce and D-PSGD, while still achieving comparable accuracy, and outperforms the existing well-known asynchronous algorithms, such as AD-PSGD and OSGP, especially in the presence of stragglers.","PeriodicalId":56268,"journal":{"name":"IEEE Transactions on Signal and Information Processing over Networks","volume":"10 ","pages":"665-678"},"PeriodicalIF":3.0000,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"R-FAST: Robust Fully-Asynchronous Stochastic Gradient Tracking Over General Topology\",\"authors\":\"Zehan Zhu;Ye Tian;Yan Huang;Jinming Xu;Shibo He\",\"doi\":\"10.1109/TSIPN.2024.3444484\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We propose a Robust Fully-Asynchronous Stochastic Gradient Tracking method (R-FAST) for distributed machine learning problems over a network of nodes, where each node performs local computation and communication at its own pace without any form of synchronization. Different from existing asynchronous distributed algorithms, R-FAST can eliminate the impact of data heterogeneity across nodes on convergence performance and allow for packet losses by employing a robust gradient tracking strategy that relies on properly designed auxiliary variables for tracking and buffering the overall gradient vector. Moreover, the proposed method utilizes two spanning-tree graphs for communication so long as both share at least one common root, enabling flexible designs in communication topologies. We show that R-FAST converges in expectation to a neighborhood of the optimum with a geometric rate for smooth and strongly convex objectives; and to a stationary point with a sublinear rate for general non-convex problems. Extensive experiments demonstrate that R-FAST runs 1.5-2 times faster than synchronous benchmark algorithms, such as Ring-AllReduce and D-PSGD, while still achieving comparable accuracy, and outperforms the existing well-known asynchronous algorithms, such as AD-PSGD and OSGP, especially in the presence of stragglers.\",\"PeriodicalId\":56268,\"journal\":{\"name\":\"IEEE Transactions on Signal and Information Processing over Networks\",\"volume\":\"10 \",\"pages\":\"665-678\"},\"PeriodicalIF\":3.0000,\"publicationDate\":\"2024-08-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Signal and Information Processing over Networks\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10660468/\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Signal and Information Processing over Networks","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10660468/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

摘要

我们针对节点网络上的分布式机器学习问题提出了一种鲁棒全异步随机梯度跟踪方法（R-FAST），在这种方法中，每个节点都以自己的节奏执行本地计算和通信，而不需要任何形式的同步。与现有的异步分布式算法不同，R-FAST 可以消除节点间数据异质性对收敛性能的影响，并通过采用鲁棒梯度跟踪策略，依靠适当设计的辅助变量来跟踪和缓冲整体梯度向量，从而允许数据包丢失。此外，只要两个生成树图至少有一个共同的根，所提出的方法就能利用两个生成树图进行通信，从而实现灵活的通信拓扑设计。我们的研究表明，对于平滑和强凸目标，R-FAST 在期望值上以几何速度收敛到最优点附近；对于一般非凸问题，R-FAST 以亚线性速度收敛到静止点。大量实验证明，R-FAST 的运行速度比 Ring-AllReduce 和 D-PSGD 等同步基准算法快 1.5-2 倍，同时还能达到相当的精度，并且优于 AD-PSGD 和 OSGP 等现有的著名异步算法，尤其是在有散兵游勇的情况下。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

R-FAST: Robust Fully-Asynchronous Stochastic Gradient Tracking Over General Topology

We propose a Robust Fully-Asynchronous Stochastic Gradient Tracking method (R-FAST) for distributed machine learning problems over a network of nodes, where each node performs local computation and communication at its own pace without any form of synchronization. Different from existing asynchronous distributed algorithms, R-FAST can eliminate the impact of data heterogeneity across nodes on convergence performance and allow for packet losses by employing a robust gradient tracking strategy that relies on properly designed auxiliary variables for tracking and buffering the overall gradient vector. Moreover, the proposed method utilizes two spanning-tree graphs for communication so long as both share at least one common root, enabling flexible designs in communication topologies. We show that R-FAST converges in expectation to a neighborhood of the optimum with a geometric rate for smooth and strongly convex objectives; and to a stationary point with a sublinear rate for general non-convex problems. Extensive experiments demonstrate that R-FAST runs 1.5-2 times faster than synchronous benchmark algorithms, such as Ring-AllReduce and D-PSGD, while still achieving comparable accuracy, and outperforms the existing well-known asynchronous algorithms, such as AD-PSGD and OSGP, especially in the presence of stragglers.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Signal and Information Processing over Networks Computer Science-Computer Networks and Communications

CiteScore

5.80

自引率

12.50%

发文量

期刊介绍： The IEEE Transactions on Signal and Information Processing over Networks publishes high-quality papers that extend the classical notions of processing of signals defined over vector spaces (e.g. time and space) to processing of signals and information (data) defined over networks, potentially dynamically varying. In signal processing over networks, the topology of the network may define structural relationships in the data, or may constrain processing of the data. Topics include distributed algorithms for filtering, detection, estimation, adaptation and learning, model selection, data fusion, and diffusion or evolution of information over such networks, and applications of distributed signal processing.