Deep learning: a statistical viewpoint

IF 11.3 1区数学 Q1 MATHEMATICS

Acta Numerica Pub Date : 2021-03-16 DOI:10.1017/S0962492921000027

P. Bartlett, A. Montanari, A. Rakhlin

{"title":"Deep learning: a statistical viewpoint","authors":"P. Bartlett, A. Montanari, A. Rakhlin","doi":"10.1017/S0962492921000027","DOIUrl":null,"url":null,"abstract":"The remarkable practical success of deep learning has revealed some major surprises from a theoretical perspective. In particular, simple gradient methods easily find near-optimal solutions to non-convex optimization problems, and despite giving a near-perfect fit to training data without any explicit effort to control model complexity, these methods exhibit excellent predictive accuracy. We conjecture that specific principles underlie these phenomena: that overparametrization allows gradient methods to find interpolating solutions, that these methods implicitly impose regularization, and that overparametrization leads to benign overfitting, that is, accurate predictions despite overfitting training data. In this article, we survey recent progress in statistical learning theory that provides examples illustrating these principles in simpler settings. We first review classical uniform convergence results and why they fall short of explaining aspects of the behaviour of deep learning methods. We give examples of implicit regularization in simple settings, where gradient methods lead to minimal norm functions that perfectly fit the training data. Then we review prediction methods that exhibit benign overfitting, focusing on regression problems with quadratic loss. For these methods, we can decompose the prediction rule into a simple component that is useful for prediction and a spiky component that is useful for overfitting but, in a favourable setting, does not harm prediction accuracy. We focus specifically on the linear regime for neural networks, where the network can be approximated by a linear model. In this regime, we demonstrate the success of gradient flow, and we consider benign overfitting with two-layer networks, giving an exact asymptotic analysis that precisely demonstrates the impact of overparametrization. We conclude by highlighting the key challenges that arise in extending these insights to realistic deep learning settings.","PeriodicalId":48863,"journal":{"name":"Acta Numerica","volume":"30 1","pages":"87 - 201"},"PeriodicalIF":11.3000,"publicationDate":"2021-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1017/S0962492921000027","citationCount":"177","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Acta Numerica","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1017/S0962492921000027","RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICS","Score":null,"Total":0}

引用次数: 177

Abstract

The remarkable practical success of deep learning has revealed some major surprises from a theoretical perspective. In particular, simple gradient methods easily find near-optimal solutions to non-convex optimization problems, and despite giving a near-perfect fit to training data without any explicit effort to control model complexity, these methods exhibit excellent predictive accuracy. We conjecture that specific principles underlie these phenomena: that overparametrization allows gradient methods to find interpolating solutions, that these methods implicitly impose regularization, and that overparametrization leads to benign overfitting, that is, accurate predictions despite overfitting training data. In this article, we survey recent progress in statistical learning theory that provides examples illustrating these principles in simpler settings. We first review classical uniform convergence results and why they fall short of explaining aspects of the behaviour of deep learning methods. We give examples of implicit regularization in simple settings, where gradient methods lead to minimal norm functions that perfectly fit the training data. Then we review prediction methods that exhibit benign overfitting, focusing on regression problems with quadratic loss. For these methods, we can decompose the prediction rule into a simple component that is useful for prediction and a spiky component that is useful for overfitting but, in a favourable setting, does not harm prediction accuracy. We focus specifically on the linear regime for neural networks, where the network can be approximated by a linear model. In this regime, we demonstrate the success of gradient flow, and we consider benign overfitting with two-layer networks, giving an exact asymptotic analysis that precisely demonstrates the impact of overparametrization. We conclude by highlighting the key challenges that arise in extending these insights to realistic deep learning settings.

查看原文本刊更多论文

深度学习:统计学观点

深度学习的显著实践成功从理论角度揭示了一些重大惊喜。特别是，简单的梯度方法很容易找到非凸优化问题的近似最优解，尽管在没有任何明确的控制模型复杂性的努力的情况下对训练数据给出了近似完美的拟合，但这些方法表现出了优异的预测准确性。我们推测这些现象背后的具体原理是：过帧化允许梯度方法找到插值解，这些方法隐含地施加正则化，并且过帧化导致良性过拟合，即，尽管训练数据过拟合，但仍能准确预测。在这篇文章中，我们调查了统计学习理论的最新进展，并提供了在更简单的环境中说明这些原则的例子。我们首先回顾了经典的一致收敛结果，以及为什么它们不能解释深度学习方法行为的各个方面。我们给出了在简单设置中的隐式正则化的例子，其中梯度方法导致完美拟合训练数据的最小范数函数。然后，我们回顾了表现出良性过拟合的预测方法，重点关注具有二次损失的回归问题。对于这些方法，我们可以将预测规则分解为对预测有用的简单分量和对过拟合有用的尖峰分量，但在有利的设置下，不会损害预测精度。我们特别关注神经网络的线性状态，其中网络可以通过线性模型来近似。在这种情况下，我们证明了梯度流的成功，并考虑了两层网络的良性过拟合，给出了精确的渐近分析，精确地证明了过帧化的影响。最后，我们强调了将这些见解扩展到现实的深度学习环境中所面临的关键挑战。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Acta Numerica MATHEMATICS-

CiteScore

26.00

自引率

0.70%

发文量

期刊介绍： Acta Numerica stands as the preeminent mathematics journal, ranking highest in both Impact Factor and MCQ metrics. This annual journal features a collection of review articles that showcase survey papers authored by prominent researchers in numerical analysis, scientific computing, and computational mathematics. These papers deliver comprehensive overviews of recent advances, offering state-of-the-art techniques and analyses. Encompassing the entirety of numerical analysis, the articles are crafted in an accessible style, catering to researchers at all levels and serving as valuable teaching aids for advanced instruction. The broad subject areas covered include computational methods in linear algebra, optimization, ordinary and partial differential equations, approximation theory, stochastic analysis, nonlinear dynamical systems, as well as the application of computational techniques in science and engineering. Acta Numerica also delves into the mathematical theory underpinning numerical methods, making it a versatile and authoritative resource in the field of mathematics.