On the Trade-Off Between Flatness and Optimization in Distributed Learning

IF 18.6

IEEE transactions on pattern analysis and machine intelligence Pub Date : 2025-06-25 DOI:10.1109/TPAMI.2025.3583104

Ying Cao;Zhaoxian Wu;Kun Yuan;Ali H. Sayed

{"title":"On the Trade-Off Between Flatness and Optimization in Distributed Learning","authors":"Ying Cao;Zhaoxian Wu;Kun Yuan;Ali H. Sayed","doi":"10.1109/TPAMI.2025.3583104","DOIUrl":null,"url":null,"abstract":"This paper proposes a theoretical framework to evaluate and compare the performance of stochastic gradient algorithms for distributed learning in relation to their behavior around local minima in nonconvex environments. Previous works have noticed that convergence toward flat local minima tend to enhance the generalization ability of learning algorithms. This work discovers three interesting results. First, it shows that decentralized learning strategies are able to escape faster away from local minima and favor convergence toward flatter minima relative to the centralized solution. Second, in decentralized methods, the consensus strategy has a worse excess-risk performance than diffusion, giving it a better chance of escaping from local minima and favoring flatter minima. Third, and importantly, the ultimate classification accuracy is not solely dependent on the flatness of the local minimum but also on how well a learning algorithm can approach that minimum. In other words, the classification accuracy is a function of both flatness and optimization performance. In this regard, since diffusion has a lower excess-risk than consensus, when both algorithms are trained starting from random initial points, diffusion enhances the classification accuracy. The paper examines the interplay between the two measures of flatness and optimization error closely. One important conclusion is that decentralized strategies deliver in general enhanced classification accuracy because they strike a more favorable balance between flatness and optimization performance compared to the centralized solution.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 10","pages":"8873-8888"},"PeriodicalIF":18.6000,"publicationDate":"2025-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11050993/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

This paper proposes a theoretical framework to evaluate and compare the performance of stochastic gradient algorithms for distributed learning in relation to their behavior around local minima in nonconvex environments. Previous works have noticed that convergence toward flat local minima tend to enhance the generalization ability of learning algorithms. This work discovers three interesting results. First, it shows that decentralized learning strategies are able to escape faster away from local minima and favor convergence toward flatter minima relative to the centralized solution. Second, in decentralized methods, the consensus strategy has a worse excess-risk performance than diffusion, giving it a better chance of escaping from local minima and favoring flatter minima. Third, and importantly, the ultimate classification accuracy is not solely dependent on the flatness of the local minimum but also on how well a learning algorithm can approach that minimum. In other words, the classification accuracy is a function of both flatness and optimization performance. In this regard, since diffusion has a lower excess-risk than consensus, when both algorithms are trained starting from random initial points, diffusion enhances the classification accuracy. The paper examines the interplay between the two measures of flatness and optimization error closely. One important conclusion is that decentralized strategies deliver in general enhanced classification accuracy because they strike a more favorable balance between flatness and optimization performance compared to the centralized solution.

查看原文本刊更多论文

分布式学习中平坦性与优化的权衡

本文提出了一个理论框架来评估和比较随机梯度算法在分布式学习中的性能，以及它们在非凸环境中围绕局部最小值的行为。以前的研究已经注意到，收敛于平坦局部最小值倾向于增强学习算法的泛化能力。这项工作发现了三个有趣的结果。首先，它表明，相对于集中式解决方案，分散学习策略能够更快地摆脱局部最小值，并倾向于收敛到平坦的最小值。其次，在分散方法中，共识策略比扩散策略具有更差的超额风险性能，使其有更好的机会摆脱局部最小值并倾向于平坦最小值。第三，重要的是，最终的分类精度不仅取决于局部最小值的平坦度，还取决于学习算法接近该最小值的程度。换句话说，分类精度是平面度和优化性能的函数。在这方面，由于扩散比共识具有更低的超额风险，当两种算法都从随机初始点开始训练时，扩散提高了分类精度。本文研究了平面度和优化误差这两个指标之间的相互作用。一个重要的结论是，与集中式解决方案相比，分散策略通常可以提高分类准确性，因为它们在平稳性和优化性能之间取得了更有利的平衡。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on pattern analysis and machine intelligence

自引率

0.00%

发文量