On the regularization effect of stochastic gradient descent applied to least-squares

ETNA - Electronic Transactions on Numerical Analysis Pub Date : 2020-07-27 DOI:10.1553/etna_vol54s610

S. Steinerberger

引用次数: 5

Abstract

We study the behavior of stochastic gradient descent applied to $\|Ax -b \|_2^2 \rightarrow \min$ for invertible $A \in \mathbb{R}^{n \times n}$. We show that there is an explicit constant $c_{A}$ depending (mildly) on $A$ such that $$ \mathbb{E} ~\left\| Ax_{k+1}-b\right\|^2_{2} \leq \left(1 + \frac{c_{A}}{\|A\|_F^2}\right) \left\|A x_k -b \right\|^2_{2} - \frac{2}{\|A\|_F^2} \left\|A^T A (x_k - x)\right\|^2_{2}.$$ This is a curious inequality: the last term has one more matrix applied to the residual $u_k - u$ than the remaining terms: if $x_k - x$ is mainly comprised of large singular vectors, stochastic gradient descent leads to a quick regularization. For symmetric matrices, this inequality has an extension to higher-order Sobolev spaces. This explains a (known) regularization phenomenon: an energy cascade from large singular values to small singular values smoothes.

查看原文本刊更多论文

应用于最小二乘的随机梯度下降的正则化效果

研究了对于可逆的$A \in \mathbb{R}^{n \times n}$，应用于$\|Ax -b \|_2^2 \rightarrow \min$的随机梯度下降的行为。我们表明，有一个显式常数$c_{A}$(温和地)依赖于$A$，使得$$ \mathbb{E} ~\left\| Ax_{k+1}-b\right\|^2_{2} \leq \left(1 + \frac{c_{A}}{\|A\|_F^2}\right) \left\|A x_k -b \right\|^2_{2} - \frac{2}{\|A\|_F^2} \left\|A^T A (x_k - x)\right\|^2_{2}.$$这是一个奇怪的不等式:最后一项比其余项多一个矩阵应用于残差$u_k - u$:如果$x_k - x$主要由大的奇异向量组成，随机梯度下降导致快速正则化。对于对称矩阵，这个不等式可以推广到高阶Sobolev空间。这解释了一个(已知的)正则化现象:从大奇异值到小奇异值的能量级联平滑。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ETNA - Electronic Transactions on Numerical Analysis

自引率

0.00%

发文量