Algorithmic Stability of Heavy-Tailed Stochastic Gradient Descent on Least Squares

International Conference on Algorithmic Learning Theory Pub Date : 2022-06-02 DOI:10.48550/arXiv.2206.01274

Anant Raj, Melih Barsbey, M. Gürbüzbalaban, Lingjiong Zhu, Umut Simsekli

{"title":"Algorithmic Stability of Heavy-Tailed Stochastic Gradient Descent on Least Squares","authors":"Anant Raj, Melih Barsbey, M. Gürbüzbalaban, Lingjiong Zhu, Umut Simsekli","doi":"10.48550/arXiv.2206.01274","DOIUrl":null,"url":null,"abstract":"Recent studies have shown that heavy tails can emerge in stochastic optimization and that the heaviness of the tails have links to the generalization error. While these studies have shed light on interesting aspects of the generalization behavior in modern settings, they relied on strong topological and statistical regularity assumptions, which are hard to verify in practice. Furthermore, it has been empirically illustrated that the relation between heavy tails and generalization might not always be monotonic in practice, contrary to the conclusions of existing theory. In this study, we establish novel links between the tail behavior and generalization properties of stochastic gradient descent (SGD), through the lens of algorithmic stability. We consider a quadratic optimization problem and use a heavy-tailed stochastic differential equation (and its Euler discretization) as a proxy for modeling the heavy-tailed behavior emerging in SGD. We then prove uniform stability bounds, which reveal the following outcomes: (i) Without making any exotic assumptions, we show that SGD will not be stable if the stability is measured with the squared-loss $x\\mapsto x^2$, whereas it in turn becomes stable if the stability is instead measured with a surrogate loss $x\\mapsto |x|^p$ with some $p<2$. (ii) Depending on the variance of the data, there exists a \\emph{`threshold of heavy-tailedness'} such that the generalization error decreases as the tails become heavier, as long as the tails are lighter than this threshold. This suggests that the relation between heavy tails and generalization is not globally monotonic. (iii) We prove matching lower-bounds on uniform stability, implying that our bounds are tight in terms of the heaviness of the tails. We support our theory with synthetic and real neural network experiments.","PeriodicalId":267197,"journal":{"name":"International Conference on Algorithmic Learning Theory","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Algorithmic Learning Theory","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2206.01274","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Recent studies have shown that heavy tails can emerge in stochastic optimization and that the heaviness of the tails have links to the generalization error. While these studies have shed light on interesting aspects of the generalization behavior in modern settings, they relied on strong topological and statistical regularity assumptions, which are hard to verify in practice. Furthermore, it has been empirically illustrated that the relation between heavy tails and generalization might not always be monotonic in practice, contrary to the conclusions of existing theory. In this study, we establish novel links between the tail behavior and generalization properties of stochastic gradient descent (SGD), through the lens of algorithmic stability. We consider a quadratic optimization problem and use a heavy-tailed stochastic differential equation (and its Euler discretization) as a proxy for modeling the heavy-tailed behavior emerging in SGD. We then prove uniform stability bounds, which reveal the following outcomes: (i) Without making any exotic assumptions, we show that SGD will not be stable if the stability is measured with the squared-loss $x\mapsto x^2$, whereas it in turn becomes stable if the stability is instead measured with a surrogate loss $x\mapsto |x|^p$ with some $p<2$. (ii) Depending on the variance of the data, there exists a \emph{`threshold of heavy-tailedness'} such that the generalization error decreases as the tails become heavier, as long as the tails are lighter than this threshold. This suggests that the relation between heavy tails and generalization is not globally monotonic. (iii) We prove matching lower-bounds on uniform stability, implying that our bounds are tight in terms of the heaviness of the tails. We support our theory with synthetic and real neural network experiments.

查看原文本刊更多论文

最小二乘上重尾随机梯度下降算法的稳定性

近年来的研究表明，在随机优化中会出现重尾现象，而重尾现象与泛化误差有关。虽然这些研究揭示了现代环境中泛化行为的有趣方面，但它们依赖于很强的拓扑和统计规则假设，这些假设很难在实践中得到验证。此外，经验表明，在实践中，重尾与泛化之间的关系可能并不总是单调的，这与现有理论的结论相反。在这项研究中，我们通过算法稳定性的镜头建立了随机梯度下降(SGD)的尾部行为和泛化特性之间的新联系。我们考虑一个二次优化问题，并使用一个重尾随机微分方程(及其欧拉离散化)作为模拟SGD中出现的重尾行为的代理。然后，我们证明了一致的稳定性界，它揭示了以下结果:(i)在不做任何奇异假设的情况下，我们表明，如果用平方损失$x\mapsto x^2$测量稳定性，SGD将不稳定，而如果用替代损失$x\mapsto |x|^p$和一些$p<2$来测量稳定性，SGD将变得稳定。(ii)根据数据的方差，存在一个\emph{“重尾阈值”}，这样，只要尾部比这个阈值轻，泛化误差就会随着尾部变重而减小。这表明重尾与泛化之间的关系不是全局单调的。(iii)我们证明了均匀稳定性的匹配下界，这意味着我们的下界在尾巴的质量方面是紧密的。我们用合成的和真实的神经网络实验来支持我们的理论。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Conference on Algorithmic Learning Theory

自引率

0.00%

发文量