SGD的高维极限定理：有效动力学和临界标度

IF 2.7 1区数学 Q1 MATHEMATICS

Communications on Pure and Applied Mathematics Pub Date : 2023-10-04 DOI:10.1002/cpa.22169

Gérard Ben Arous, Reza Gheissari, Aukosh Jagannath

{"title":"SGD的高维极限定理：有效动力学和临界标度","authors":"Gérard Ben Arous, Reza Gheissari, Aukosh Jagannath","doi":"10.1002/cpa.22169","DOIUrl":null,"url":null,"abstract":"<p>We study the scaling limits of stochastic gradient descent (SGD) with constant step-size in the high-dimensional regime. We prove limit theorems for the trajectories of summary statistics (i.e., finite-dimensional functions) of SGD as the dimension goes to infinity. Our approach allows one to choose the summary statistics that are tracked, the initialization, and the step-size. It yields both ballistic (ODE) and diffusive (SDE) limits, with the limit depending dramatically on the former choices. We show a critical scaling regime for the step-size, below which the effective ballistic dynamics matches gradient flow for the population loss, but at which, a new correction term appears which changes the phase diagram. About the fixed points of this effective dynamics, the corresponding diffusive limits can be quite complex and even degenerate. We demonstrate our approach on popular examples including estimation for spiked matrix and tensor models and classification via two-layer networks for binary and XOR-type Gaussian mixture models. These examples exhibit surprising phenomena including multimodal timescales to convergence as well as convergence to sub-optimal solutions with probability bounded away from zero from random (e.g., Gaussian) initializations. At the same time, we demonstrate the benefit of overparametrization by showing that the latter probability goes to zero as the second layer width grows.</p>","PeriodicalId":10601,"journal":{"name":"Communications on Pure and Applied Mathematics","volume":"77 3","pages":"2030-2080"},"PeriodicalIF":2.7000,"publicationDate":"2023-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cpa.22169","citationCount":"0","resultStr":"{\"title\":\"High-dimensional limit theorems for SGD: Effective dynamics and critical scaling\",\"authors\":\"Gérard Ben Arous, Reza Gheissari, Aukosh Jagannath\",\"doi\":\"10.1002/cpa.22169\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>We study the scaling limits of stochastic gradient descent (SGD) with constant step-size in the high-dimensional regime. We prove limit theorems for the trajectories of summary statistics (i.e., finite-dimensional functions) of SGD as the dimension goes to infinity. Our approach allows one to choose the summary statistics that are tracked, the initialization, and the step-size. It yields both ballistic (ODE) and diffusive (SDE) limits, with the limit depending dramatically on the former choices. We show a critical scaling regime for the step-size, below which the effective ballistic dynamics matches gradient flow for the population loss, but at which, a new correction term appears which changes the phase diagram. About the fixed points of this effective dynamics, the corresponding diffusive limits can be quite complex and even degenerate. We demonstrate our approach on popular examples including estimation for spiked matrix and tensor models and classification via two-layer networks for binary and XOR-type Gaussian mixture models. These examples exhibit surprising phenomena including multimodal timescales to convergence as well as convergence to sub-optimal solutions with probability bounded away from zero from random (e.g., Gaussian) initializations. At the same time, we demonstrate the benefit of overparametrization by showing that the latter probability goes to zero as the second layer width grows.</p>\",\"PeriodicalId\":10601,\"journal\":{\"name\":\"Communications on Pure and Applied Mathematics\",\"volume\":\"77 3\",\"pages\":\"2030-2080\"},\"PeriodicalIF\":2.7000,\"publicationDate\":\"2023-10-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cpa.22169\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Communications on Pure and Applied Mathematics\",\"FirstCategoryId\":\"100\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1002/cpa.22169\",\"RegionNum\":1,\"RegionCategory\":\"数学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"MATHEMATICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Communications on Pure and Applied Mathematics","FirstCategoryId":"100","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/cpa.22169","RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICS","Score":null,"Total":0}

引用次数: 0

摘要

我们研究了高维区域中具有恒定步长的随机梯度下降（SGD）的标度极限。我们证明了SGD的汇总统计（即有限维函数）的轨迹在维数无穷大时的极限定理。我们的方法允许选择跟踪的汇总统计信息、初始化和步长。它产生了弹道（ODE）和扩散（SDE）极限，极限在很大程度上取决于前一种选择。我们展示了步长的临界标度制度，低于该制度，有效弹道动力学与种群损失的梯度流相匹配，但在该制度下，出现了一个新的校正项，它改变了相图。关于这种有效动力学的不动点，相应的扩散极限可能相当复杂，甚至退化。我们在流行的例子中展示了我们的方法，包括对尖峰矩阵和张量模型的估计，以及通过二元和XOR型高斯混合模型的两层网络进行分类。这些例子展示了令人惊讶的现象，包括收敛的多模式时间尺度，以及从随机（例如，高斯）初始化到概率为零的次优解的收敛。同时，我们通过表明后一种概率随着第二层宽度的增长而变为零来证明过帧化的好处。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

High-dimensional limit theorems for SGD: Effective dynamics and critical scaling

查看原文本刊更多论文

High-dimensional limit theorems for SGD: Effective dynamics and critical scaling

We study the scaling limits of stochastic gradient descent (SGD) with constant step-size in the high-dimensional regime. We prove limit theorems for the trajectories of summary statistics (i.e., finite-dimensional functions) of SGD as the dimension goes to infinity. Our approach allows one to choose the summary statistics that are tracked, the initialization, and the step-size. It yields both ballistic (ODE) and diffusive (SDE) limits, with the limit depending dramatically on the former choices. We show a critical scaling regime for the step-size, below which the effective ballistic dynamics matches gradient flow for the population loss, but at which, a new correction term appears which changes the phase diagram. About the fixed points of this effective dynamics, the corresponding diffusive limits can be quite complex and even degenerate. We demonstrate our approach on popular examples including estimation for spiked matrix and tensor models and classification via two-layer networks for binary and XOR-type Gaussian mixture models. These examples exhibit surprising phenomena including multimodal timescales to convergence as well as convergence to sub-optimal solutions with probability bounded away from zero from random (e.g., Gaussian) initializations. At the same time, we demonstrate the benefit of overparametrization by showing that the latter probability goes to zero as the second layer width grows.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Communications on Pure and Applied Mathematics 数学-数学

CiteScore

6.70

自引率

3.30%

发文量

审稿时长

>12 weeks

期刊介绍： Communications on Pure and Applied Mathematics (ISSN 0010-3640) is published monthly, one volume per year, by John Wiley & Sons, Inc. © 2019. The journal primarily publishes papers originating at or solicited by the Courant Institute of Mathematical Sciences. It features recent developments in applied mathematics, mathematical physics, and mathematical analysis. The topics include partial differential equations, computer science, and applied mathematics. CPAM is devoted to mathematical contributions to the sciences; both theoretical and applied papers, of original or expository type, are included.