Generalization Guarantees of Gradient Descent for Shallow Neural Networks

IF 2.1 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Neural Computation Pub Date : 2025-01-21 DOI:10.1162/neco_a_01725

Puyu Wang;Yunwen Lei;Di Wang;Yiming Ying;Ding-Xuan Zhou

{"title":"Generalization Guarantees of Gradient Descent for Shallow Neural Networks","authors":"Puyu Wang;Yunwen Lei;Di Wang;Yiming Ying;Ding-Xuan Zhou","doi":"10.1162/neco_a_01725","DOIUrl":null,"url":null,"abstract":"Significant progress has been made recently in understanding the generalization of neural networks (NNs) trained by gradient descent (GD) using the algorithmic stability approach. However, most of the existing research has focused on one-hidden-layer NNs and has not addressed the impact of different network scaling. Here, network scaling corresponds to the normalization of the layers. In this article, we greatly extend the previous work (Lei et al., 2022; Richards & Kuzborskij, 2021) by conducting a comprehensive stability and generalization analysis of GD for two-layer and three-layer NNs. For two-layer NNs, our results are established under general network scaling, relaxing previous conditions. In the case of three-layer NNs, our technical contribution lies in demonstrating its nearly co-coercive property by utilizing a novel induction strategy that thoroughly explores the effects of overparameterization. As a direct application of our general findings, we derive the excess risk rate of O(1/n) for GD in both two-layer and three-layer NNs. This sheds light on sufficient or necessary conditions for underparameterized and overparameterized NNs trained by GD to attain the desired risk rate of O(1/n). Moreover, we demonstrate that as the scaling factor increases or the network complexity decreases, less overparameterization is required for GD to achieve the desired error rates. Additionally, under a low-noise condition, we obtain a fast risk rate of O(1/n) for GD in both two-layer and three-layer NNs.","PeriodicalId":54731,"journal":{"name":"Neural Computation","volume":"37 2","pages":"344-402"},"PeriodicalIF":2.1000,"publicationDate":"2025-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Computation","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10887373/","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Significant progress has been made recently in understanding the generalization of neural networks (NNs) trained by gradient descent (GD) using the algorithmic stability approach. However, most of the existing research has focused on one-hidden-layer NNs and has not addressed the impact of different network scaling. Here, network scaling corresponds to the normalization of the layers. In this article, we greatly extend the previous work (Lei et al., 2022; Richards & Kuzborskij, 2021) by conducting a comprehensive stability and generalization analysis of GD for two-layer and three-layer NNs. For two-layer NNs, our results are established under general network scaling, relaxing previous conditions. In the case of three-layer NNs, our technical contribution lies in demonstrating its nearly co-coercive property by utilizing a novel induction strategy that thoroughly explores the effects of overparameterization. As a direct application of our general findings, we derive the excess risk rate of O(1/n) for GD in both two-layer and three-layer NNs. This sheds light on sufficient or necessary conditions for underparameterized and overparameterized NNs trained by GD to attain the desired risk rate of O(1/n). Moreover, we demonstrate that as the scaling factor increases or the network complexity decreases, less overparameterization is required for GD to achieve the desired error rates. Additionally, under a low-noise condition, we obtain a fast risk rate of O(1/n) for GD in both two-layer and three-layer NNs.

查看原文本刊更多论文

浅层神经网络梯度下降的泛化保证

近来，在利用算法稳定性方法理解通过梯度下降（GD）训练的神经网络（NN）的泛化方面取得了重大进展。然而，现有研究大多集中于单隐层神经网络，并未涉及不同网络规模的影响。在这里，网络缩放相当于层的规范化。在本文中，我们大大扩展了之前的工作（Lei 等人，2022；Richards & Kuzborskij，2021），对两层和三层 NN 的 GD 进行了全面的稳定性和泛化分析。对于两层 NN，我们的结果是在一般网络缩放条件下建立的，放宽了之前的条件。对于三层网络，我们的技术贡献在于利用一种新颖的归纳策略，彻底探讨了过参数化的影响，从而证明了其近乎协迫的特性。作为我们一般发现的直接应用，我们得出了两层和三层网络中 GD 的超额风险率为 O(1/n)。这揭示了通过 GD 训练的欠参数化和过参数化 NN 达到 O(1/n) 期望风险率的充分或必要条件。此外，我们还证明，随着缩放因子的增加或网络复杂度的降低，GD 所需的过参数化程度也会降低，从而达到所需的错误率。此外，在低噪声条件下，我们在两层和三层 NN 中都获得了 O(1/n)的快速风险率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Neural Computation 工程技术-计算机：人工智能

CiteScore

6.30

自引率

3.40%

发文量

审稿时长

3.0 months

期刊介绍： Neural Computation is uniquely positioned at the crossroads between neuroscience and TMCS and welcomes the submission of original papers from all areas of TMCS, including: Advanced experimental design; Analysis of chemical sensor data; Connectomic reconstructions; Analysis of multielectrode and optical recordings; Genetic data for cell identity; Analysis of behavioral data; Multiscale models; Analysis of molecular mechanisms; Neuroinformatics; Analysis of brain imaging data; Neuromorphic engineering; Principles of neural coding, computation, circuit dynamics, and plasticity; Theories of brain function.