SURPRISES IN HIGH-DIMENSIONAL RIDGELESS LEAST SQUARES INTERPOLATION.

IF 3.2 1区数学 Q1 STATISTICS & PROBABILITY

Annals of Statistics Pub Date : 2022-04-01 Epub Date: 2022-04-07 DOI:10.1214/21-aos2133

Trevor Hastie, Andrea Montanari, Saharon Rosset, Ryan J Tibshirani

{"title":"SURPRISES IN HIGH-DIMENSIONAL RIDGELESS LEAST SQUARES INTERPOLATION.","authors":"Trevor Hastie, Andrea Montanari, Saharon Rosset, Ryan J Tibshirani","doi":"10.1214/21-aos2133","DOIUrl":null,"url":null,"abstract":"Interpolators-estimators that achieve zero training error-have attracted growing attention in machine learning, mainly because state-of-the art neural networks appear to be models of this type. In this paper, we study minimum ℓ 2 norm (\"ridgeless\") interpolation least squares regression, focusing on the high-dimensional regime in which the number of unknown parameters p is of the same order as the number of samples n. We consider two different models for the feature distribution: a linear model, where the feature vectors <math> <mrow><msub><mi>x</mi> <mi>i</mi></msub> <mo>∈</mo> <msup><mi>ℝ</mi> <mi>p</mi></msup> </mrow> </math> are obtained by applying a linear transform to a vector of i.i.d. entries, x i = Σ1/2 z i (with <math> <mrow><msub><mi>z</mi> <mi>i</mi></msub> <mo>∈</mo> <msup><mi>ℝ</mi> <mi>p</mi></msup> </mrow> </math> ); and a nonlinear model, where the feature vectors are obtained by passing the input through a random one-layer neural network, xi = φ(Wz i ) (with <math> <mrow><msub><mi>z</mi> <mi>i</mi></msub> <mo>∈</mo> <msup><mi>ℝ</mi> <mi>d</mi></msup> </mrow> </math> , <math><mrow><mi>W</mi> <mo>∈</mo> <msup><mi>ℝ</mi> <mrow><mi>p</mi> <mo>×</mo> <mi>d</mi></mrow> </msup> </mrow> </math> a matrix of i.i.d. entries, and φ an activation function acting componentwise on Wz i ). We recover-in a precise quantitative way-several phenomena that have been observed in large-scale neural networks and kernel machines, including the \"double descent\" behavior of the prediction risk, and the potential benefits of overparametrization.","PeriodicalId":8032,"journal":{"name":"Annals of Statistics","volume":" ","pages":"949-986"},"PeriodicalIF":3.2000,"publicationDate":"2022-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9481183/pdf/nihms-1830540.pdf","citationCount":"579","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annals of Statistics","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1214/21-aos2133","RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2022/4/7 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}

引用次数: 579

Abstract

Interpolators-estimators that achieve zero training error-have attracted growing attention in machine learning, mainly because state-of-the art neural networks appear to be models of this type. In this paper, we study minimum ℓ ₂ norm ("ridgeless") interpolation least squares regression, focusing on the high-dimensional regime in which the number of unknown parameters p is of the same order as the number of samples n. We consider two different models for the feature distribution: a linear model, where the feature vectors $x_{i} \in ℝ^{p}$ are obtained by applying a linear transform to a vector of i.i.d. entries, x _i = Σ^1/2 z _i (with $z_{i} \in ℝ^{p}$ ); and a nonlinear model, where the feature vectors are obtained by passing the input through a random one-layer neural network, x_i = φ(Wz _i ) (with $z_{i} \in ℝ^{d}$ , $W \in ℝ^{p \times d}$ a matrix of i.i.d. entries, and φ an activation function acting componentwise on Wz _i ). We recover-in a precise quantitative way-several phenomena that have been observed in large-scale neural networks and kernel machines, including the "double descent" behavior of the prediction risk, and the potential benefits of overparametrization.

Abstract Image

查看原文本刊更多论文

高维无脊最小二乘插值中的奇异值。

插值器——实现零训练误差的估计器——在机器学习中引起了越来越多的关注，主要是因为最先进的神经网络似乎就是这种类型的模型。在本文中,我们研究最小ℓ2规范(“ridgeless”)插值最小二乘回归,关注的高维政权的未知参数p是相同的样品订单数量n。我们考虑两种不同的模型特性分布:一个线性模型,特征向量x我∈ℝp是通过应用一个线性变换的向量i.i.d.条目,x =Σ1/2 z (z我∈ℝp);和一个非线性模型，其中特征向量是通过一个随机的单层神经网络传递输入得到的，xi = φ(Wz i)(其中，z i∈h, W∈h, p × d是一个包含i个元素的矩阵，φ是一个分量作用于Wz i的激活函数)。我们以一种精确的定量方式恢复了在大规模神经网络和核机器中观察到的几种现象，包括预测风险的“双重下降”行为，以及过度参数化的潜在好处。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Annals of Statistics 数学-统计学与概率论

CiteScore

9.30

自引率

8.90%

发文量

119

审稿时长

6-12 weeks

期刊介绍： The Annals of Statistics aim to publish research papers of highest quality reflecting the many facets of contemporary statistics. Primary emphasis is placed on importance and originality, not on formalism. The journal aims to cover all areas of statistics, especially mathematical statistics and applied & interdisciplinary statistics. Of course many of the best papers will touch on more than one of these general areas, because the discipline of statistics has deep roots in mathematics, and in substantive scientific fields.