Simple and Optimal Stochastic Gradient Methods for Nonsmooth Nonconvex Optimization

J. Mach. Learn. Res. Pub Date : 2022-08-22 DOI:10.48550/arXiv.2208.10025

Zhize Li, Jian Li

{"title":"Simple and Optimal Stochastic Gradient Methods for Nonsmooth Nonconvex Optimization","authors":"Zhize Li, Jian Li","doi":"10.48550/arXiv.2208.10025","DOIUrl":null,"url":null,"abstract":"We propose and analyze several stochastic gradient algorithms for finding stationary points or local minimum in nonconvex, possibly with nonsmooth regularizer, finite-sum and online optimization problems. First, we propose a simple proximal stochastic gradient algorithm based on variance reduction called ProxSVRG+. We provide a clean and tight analysis of ProxSVRG+, which shows that it outperforms the deterministic proximal gradient descent (ProxGD) for a wide range of minibatch sizes, hence solves an open problem proposed in Reddi et al. (2016b). Also, ProxSVRG+ uses much less proximal oracle calls than ProxSVRG (Reddi et al., 2016b) and extends to the online setting by avoiding full gradient computations. Then, we further propose an optimal algorithm, called SSRGD, based on SARAH (Nguyen et al., 2017) and show that SSRGD further improves the gradient complexity of ProxSVRG+ and achieves the optimal upper bound, matching the known lower bound of (Fang et al., 2018; Li et al., 2021). Moreover, we show that both ProxSVRG+ and SSRGD enjoy automatic adaptation with local structure of the objective function such as the Polyak-\\L{}ojasiewicz (PL) condition for nonconvex functions in the finite-sum case, i.e., we prove that both of them can automatically switch to faster global linear convergence without any restart performed in prior work ProxSVRG (Reddi et al., 2016b). Finally, we focus on the more challenging problem of finding an $(\\epsilon, \\delta)$-local minimum instead of just finding an $\\epsilon$-approximate (first-order) stationary point (which may be some bad unstable saddle points). We show that SSRGD can find an $(\\epsilon, \\delta)$-local minimum by simply adding some random perturbations. Our algorithm is almost as simple as its counterpart for finding stationary points, and achieves similar optimal rates.","PeriodicalId":14794,"journal":{"name":"J. Mach. Learn. Res.","volume":"47 1","pages":"239:1-239:61"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"J. Mach. Learn. Res.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2208.10025","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

We propose and analyze several stochastic gradient algorithms for finding stationary points or local minimum in nonconvex, possibly with nonsmooth regularizer, finite-sum and online optimization problems. First, we propose a simple proximal stochastic gradient algorithm based on variance reduction called ProxSVRG+. We provide a clean and tight analysis of ProxSVRG+, which shows that it outperforms the deterministic proximal gradient descent (ProxGD) for a wide range of minibatch sizes, hence solves an open problem proposed in Reddi et al. (2016b). Also, ProxSVRG+ uses much less proximal oracle calls than ProxSVRG (Reddi et al., 2016b) and extends to the online setting by avoiding full gradient computations. Then, we further propose an optimal algorithm, called SSRGD, based on SARAH (Nguyen et al., 2017) and show that SSRGD further improves the gradient complexity of ProxSVRG+ and achieves the optimal upper bound, matching the known lower bound of (Fang et al., 2018; Li et al., 2021). Moreover, we show that both ProxSVRG+ and SSRGD enjoy automatic adaptation with local structure of the objective function such as the Polyak-\L{}ojasiewicz (PL) condition for nonconvex functions in the finite-sum case, i.e., we prove that both of them can automatically switch to faster global linear convergence without any restart performed in prior work ProxSVRG (Reddi et al., 2016b). Finally, we focus on the more challenging problem of finding an $(\epsilon, \delta)$-local minimum instead of just finding an $\epsilon$-approximate (first-order) stationary point (which may be some bad unstable saddle points). We show that SSRGD can find an $(\epsilon, \delta)$-local minimum by simply adding some random perturbations. Our algorithm is almost as simple as its counterpart for finding stationary points, and achieves similar optimal rates.

查看原文本刊更多论文

非光滑非凸优化的简单最优随机梯度方法

我们提出并分析了几种随机梯度算法，用于寻找非凸的平稳点或局部最小值，可能具有非光滑正则化，有限和和在线优化问题。首先，我们提出了一种简单的基于方差约简的近端随机梯度算法ProxSVRG+。我们对ProxSVRG+进行了清晰而严密的分析，结果表明它在大范围的小批量大小下优于确定性近端梯度下降(ProxGD)，从而解决了Reddi等人(2016b)提出的一个开放问题。此外，ProxSVRG+使用的近端oracle调用比ProxSVRG少得多(Reddi等人，2016b)，并通过避免完全梯度计算扩展到在线设置。然后，我们进一步提出了一种基于SARAH的最优算法，称为SSRGD (Nguyen et al.， 2017)，并表明SSRGD进一步提高了ProxSVRG+的梯度复杂度，达到了最优上界，与已知的下界相匹配(Fang et al.， 2018;Li等人，2021)。此外，我们证明了ProxSVRG+和SSRGD都可以自动适应目标函数的局部结构，如有限和情况下非凸函数的Polyak- \L{} ojasiewicz (PL)条件，即，我们证明了它们都可以自动切换到更快的全局线性收敛，而无需在先前的工作ProxSVRG中执行任何重启(Reddi et al.， 2016b)。最后，我们将重点放在寻找$(\epsilon, \delta)$ -局部最小值的更具挑战性的问题上，而不仅仅是寻找$\epsilon$ -近似(一阶)平稳点(可能是一些不稳定的鞍点)。我们证明SSRGD可以通过简单地添加一些随机扰动来找到$(\epsilon, \delta)$ -局部最小值。我们的算法几乎和寻找平稳点的算法一样简单，并且达到了相似的最优速率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

J. Mach. Learn. Res.

自引率

0.00%

发文量