{"title":"On the regularization effect of stochastic gradient descent applied to least-squares","authors":"S. Steinerberger","doi":"10.1553/etna_vol54s610","DOIUrl":null,"url":null,"abstract":"We study the behavior of stochastic gradient descent applied to $\\|Ax -b \\|_2^2 \\rightarrow \\min$ for invertible $A \\in \\mathbb{R}^{n \\times n}$. We show that there is an explicit constant $c_{A}$ depending (mildly) on $A$ such that $$ \\mathbb{E} ~\\left\\| Ax_{k+1}-b\\right\\|^2_{2} \\leq \\left(1 + \\frac{c_{A}}{\\|A\\|_F^2}\\right) \\left\\|A x_k -b \\right\\|^2_{2} - \\frac{2}{\\|A\\|_F^2} \\left\\|A^T A (x_k - x)\\right\\|^2_{2}.$$ This is a curious inequality: the last term has one more matrix applied to the residual $u_k - u$ than the remaining terms: if $x_k - x$ is mainly comprised of large singular vectors, stochastic gradient descent leads to a quick regularization. For symmetric matrices, this inequality has an extension to higher-order Sobolev spaces. This explains a (known) regularization phenomenon: an energy cascade from large singular values to small singular values smoothes.","PeriodicalId":282695,"journal":{"name":"ETNA - Electronic Transactions on Numerical Analysis","volume":"73 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ETNA - Electronic Transactions on Numerical Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1553/etna_vol54s610","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
Abstract
We study the behavior of stochastic gradient descent applied to $\|Ax -b \|_2^2 \rightarrow \min$ for invertible $A \in \mathbb{R}^{n \times n}$. We show that there is an explicit constant $c_{A}$ depending (mildly) on $A$ such that $$ \mathbb{E} ~\left\| Ax_{k+1}-b\right\|^2_{2} \leq \left(1 + \frac{c_{A}}{\|A\|_F^2}\right) \left\|A x_k -b \right\|^2_{2} - \frac{2}{\|A\|_F^2} \left\|A^T A (x_k - x)\right\|^2_{2}.$$ This is a curious inequality: the last term has one more matrix applied to the residual $u_k - u$ than the remaining terms: if $x_k - x$ is mainly comprised of large singular vectors, stochastic gradient descent leads to a quick regularization. For symmetric matrices, this inequality has an extension to higher-order Sobolev spaces. This explains a (known) regularization phenomenon: an energy cascade from large singular values to small singular values smoothes.