Trevor Hastie, Andrea Montanari, Saharon Rosset, Ryan J Tibshirani
{"title":"高维无脊最小二乘插值中的奇异值。","authors":"Trevor Hastie, Andrea Montanari, Saharon Rosset, Ryan J Tibshirani","doi":"10.1214/21-aos2133","DOIUrl":null,"url":null,"abstract":"<p><p>Interpolators-estimators that achieve zero training error-have attracted growing attention in machine learning, mainly because state-of-the art neural networks appear to be models of this type. In this paper, we study minimum <i>ℓ</i> <sub>2</sub> norm (\"ridgeless\") interpolation least squares regression, focusing on the high-dimensional regime in which the number of unknown parameters <i>p</i> is of the same order as the number of samples <i>n</i>. We consider two different models for the feature distribution: a linear model, where the feature vectors <math> <mrow><msub><mi>x</mi> <mi>i</mi></msub> <mo>∈</mo> <msup><mi>ℝ</mi> <mi>p</mi></msup> </mrow> </math> are obtained by applying a linear transform to a vector of i.i.d. entries, <i>x</i> <sub><i>i</i></sub> = Σ<sup>1/2</sup> <i>z</i> <sub><i>i</i></sub> (with <math> <mrow><msub><mi>z</mi> <mi>i</mi></msub> <mo>∈</mo> <msup><mi>ℝ</mi> <mi>p</mi></msup> </mrow> </math> ); and a nonlinear model, where the feature vectors are obtained by passing the input through a random one-layer neural network, <i>x<sub>i</sub></i> = <i>φ</i>(<i>Wz</i> <sub><i>i</i></sub> ) (with <math> <mrow><msub><mi>z</mi> <mi>i</mi></msub> <mo>∈</mo> <msup><mi>ℝ</mi> <mi>d</mi></msup> </mrow> </math> , <math><mrow><mi>W</mi> <mo>∈</mo> <msup><mi>ℝ</mi> <mrow><mi>p</mi> <mo>×</mo> <mi>d</mi></mrow> </msup> </mrow> </math> a matrix of i.i.d. entries, and <i>φ</i> an activation function acting componentwise on <i>Wz</i> <sub><i>i</i></sub> ). We recover-in a precise quantitative way-several phenomena that have been observed in large-scale neural networks and kernel machines, including the \"double descent\" behavior of the prediction risk, and the potential benefits of overparametrization.</p>","PeriodicalId":3,"journal":{"name":"ACS Applied Electronic Materials","volume":null,"pages":null},"PeriodicalIF":4.3000,"publicationDate":"2022-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9481183/pdf/nihms-1830540.pdf","citationCount":"579","resultStr":"{\"title\":\"SURPRISES IN HIGH-DIMENSIONAL RIDGELESS LEAST SQUARES INTERPOLATION.\",\"authors\":\"Trevor Hastie, Andrea Montanari, Saharon Rosset, Ryan J Tibshirani\",\"doi\":\"10.1214/21-aos2133\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Interpolators-estimators that achieve zero training error-have attracted growing attention in machine learning, mainly because state-of-the art neural networks appear to be models of this type. In this paper, we study minimum <i>ℓ</i> <sub>2</sub> norm (\\\"ridgeless\\\") interpolation least squares regression, focusing on the high-dimensional regime in which the number of unknown parameters <i>p</i> is of the same order as the number of samples <i>n</i>. We consider two different models for the feature distribution: a linear model, where the feature vectors <math> <mrow><msub><mi>x</mi> <mi>i</mi></msub> <mo>∈</mo> <msup><mi>ℝ</mi> <mi>p</mi></msup> </mrow> </math> are obtained by applying a linear transform to a vector of i.i.d. entries, <i>x</i> <sub><i>i</i></sub> = Σ<sup>1/2</sup> <i>z</i> <sub><i>i</i></sub> (with <math> <mrow><msub><mi>z</mi> <mi>i</mi></msub> <mo>∈</mo> <msup><mi>ℝ</mi> <mi>p</mi></msup> </mrow> </math> ); and a nonlinear model, where the feature vectors are obtained by passing the input through a random one-layer neural network, <i>x<sub>i</sub></i> = <i>φ</i>(<i>Wz</i> <sub><i>i</i></sub> ) (with <math> <mrow><msub><mi>z</mi> <mi>i</mi></msub> <mo>∈</mo> <msup><mi>ℝ</mi> <mi>d</mi></msup> </mrow> </math> , <math><mrow><mi>W</mi> <mo>∈</mo> <msup><mi>ℝ</mi> <mrow><mi>p</mi> <mo>×</mo> <mi>d</mi></mrow> </msup> </mrow> </math> a matrix of i.i.d. entries, and <i>φ</i> an activation function acting componentwise on <i>Wz</i> <sub><i>i</i></sub> ). We recover-in a precise quantitative way-several phenomena that have been observed in large-scale neural networks and kernel machines, including the \\\"double descent\\\" behavior of the prediction risk, and the potential benefits of overparametrization.</p>\",\"PeriodicalId\":3,\"journal\":{\"name\":\"ACS Applied Electronic Materials\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":4.3000,\"publicationDate\":\"2022-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9481183/pdf/nihms-1830540.pdf\",\"citationCount\":\"579\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACS Applied Electronic Materials\",\"FirstCategoryId\":\"100\",\"ListUrlMain\":\"https://doi.org/10.1214/21-aos2133\",\"RegionNum\":3,\"RegionCategory\":\"材料科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2022/4/7 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACS Applied Electronic Materials","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1214/21-aos2133","RegionNum":3,"RegionCategory":"材料科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2022/4/7 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 579
摘要
插值器——实现零训练误差的估计器——在机器学习中引起了越来越多的关注,主要是因为最先进的神经网络似乎就是这种类型的模型。在本文中,我们研究最小ℓ2规范(“ridgeless”)插值最小二乘回归,关注的高维政权的未知参数p是相同的样品订单数量n。我们考虑两种不同的模型特性分布:一个线性模型,特征向量x我∈ℝp是通过应用一个线性变换的向量i.i.d.条目,x =Σ1/2 z (z我∈ℝp);和一个非线性模型,其中特征向量是通过一个随机的单层神经网络传递输入得到的,xi = φ(Wz i)(其中,z i∈h, W∈h, p × d是一个包含i个元素的矩阵,φ是一个分量作用于Wz i的激活函数)。我们以一种精确的定量方式恢复了在大规模神经网络和核机器中观察到的几种现象,包括预测风险的“双重下降”行为,以及过度参数化的潜在好处。
SURPRISES IN HIGH-DIMENSIONAL RIDGELESS LEAST SQUARES INTERPOLATION.
Interpolators-estimators that achieve zero training error-have attracted growing attention in machine learning, mainly because state-of-the art neural networks appear to be models of this type. In this paper, we study minimum ℓ2 norm ("ridgeless") interpolation least squares regression, focusing on the high-dimensional regime in which the number of unknown parameters p is of the same order as the number of samples n. We consider two different models for the feature distribution: a linear model, where the feature vectors are obtained by applying a linear transform to a vector of i.i.d. entries, xi = Σ1/2zi (with ); and a nonlinear model, where the feature vectors are obtained by passing the input through a random one-layer neural network, xi = φ(Wzi ) (with , a matrix of i.i.d. entries, and φ an activation function acting componentwise on Wzi ). We recover-in a precise quantitative way-several phenomena that have been observed in large-scale neural networks and kernel machines, including the "double descent" behavior of the prediction risk, and the potential benefits of overparametrization.