Polynomial-time universality and limitations of deep learning

IF 3.1 1区 数学 Q1 MATHEMATICS
Emmanuel Abbe, Colin Sandon
{"title":"Polynomial-time universality and limitations of deep learning","authors":"Emmanuel Abbe,&nbsp;Colin Sandon","doi":"10.1002/cpa.22121","DOIUrl":null,"url":null,"abstract":"<p>The goal of this paper is to characterize function distributions that general neural networks trained by descent algorithms (GD/SGD), can or cannot learn in polytime. The results are: (1) The paradigm of general neural networks trained by SGD is poly-time universal: any function distribution that can be learned from samples in polytime can also be learned by a poly-size neural net trained by SGD with polynomial parameters. In particular, this can be achieved despite polynomial noise on the gradients, implying a separation result between SGD-based deep learning and statistical query algorithms, as the latter are not comparably universal due to cases like parities. This also shows that deep learning does not suffer from the limitations of shallow networks. (2) The paper further gives a lower-bound on the generalization error of descent algorithms, which relies on two quantities: the cross-predictability, an average-case quantity related to the statistical dimension, and the null-flow, a quantity specific to descent algorithms. The lower-bound implies in particular that for functions of low enough cross-predictability, the above robust universality breaks down once the gradients are averaged over too many samples (as in perfect GD) rather than fewer (as in SGD). (3) Finally, it is shown that if larger amounts of noise are added on the initialization and on the gradients, then SGD is no longer comparably universal due again to distributions having low enough cross-predictability.</p>","PeriodicalId":10601,"journal":{"name":"Communications on Pure and Applied Mathematics","volume":null,"pages":null},"PeriodicalIF":3.1000,"publicationDate":"2023-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cpa.22121","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Communications on Pure and Applied Mathematics","FirstCategoryId":"100","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/cpa.22121","RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICS","Score":null,"Total":0}
引用次数: 0

Abstract

The goal of this paper is to characterize function distributions that general neural networks trained by descent algorithms (GD/SGD), can or cannot learn in polytime. The results are: (1) The paradigm of general neural networks trained by SGD is poly-time universal: any function distribution that can be learned from samples in polytime can also be learned by a poly-size neural net trained by SGD with polynomial parameters. In particular, this can be achieved despite polynomial noise on the gradients, implying a separation result between SGD-based deep learning and statistical query algorithms, as the latter are not comparably universal due to cases like parities. This also shows that deep learning does not suffer from the limitations of shallow networks. (2) The paper further gives a lower-bound on the generalization error of descent algorithms, which relies on two quantities: the cross-predictability, an average-case quantity related to the statistical dimension, and the null-flow, a quantity specific to descent algorithms. The lower-bound implies in particular that for functions of low enough cross-predictability, the above robust universality breaks down once the gradients are averaged over too many samples (as in perfect GD) rather than fewer (as in SGD). (3) Finally, it is shown that if larger amounts of noise are added on the initialization and on the gradients, then SGD is no longer comparably universal due again to distributions having low enough cross-predictability.

Abstract Image

深度学习的多项式时间普遍性和局限性
本文的目标是表征由下降算法(GD/SGD)训练的一般神经网络在多时间内可以或不能学习的函数分布。结果表明:(1)SGD训练的一般神经网络范式具有多时间通用性:任何可以从多时间样本中学习到的函数分布也可以通过SGD训练的具有多项式参数的多尺度神经网络学习到。特别是,这可以在梯度上存在多项式噪声的情况下实现,这意味着基于SGD的深度学习和统计查询算法之间的分离结果,因为后者由于诸如奇偶之类的情况而不具有可比性。这也表明深度学习不会受到浅网络的限制。(2)进一步给出了下降算法泛化误差的下界,该下界依赖于两个量:交叉可预测性(与统计维度相关的平均情况量)和零流(特定于下降算法的量)。下界特别意味着,对于交叉可预测性足够低的函数,一旦梯度在太多样本(如完美GD)而不是更少样本(如SGD)上平均,上述鲁棒普遍性就会崩溃。(3)最后,研究表明,如果在初始化和梯度上添加大量的噪声,那么由于分布具有足够低的交叉可预测性,SGD不再具有相当的普适性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
6.70
自引率
3.30%
发文量
59
审稿时长
>12 weeks
期刊介绍: Communications on Pure and Applied Mathematics (ISSN 0010-3640) is published monthly, one volume per year, by John Wiley & Sons, Inc. © 2019. The journal primarily publishes papers originating at or solicited by the Courant Institute of Mathematical Sciences. It features recent developments in applied mathematics, mathematical physics, and mathematical analysis. The topics include partial differential equations, computer science, and applied mathematics. CPAM is devoted to mathematical contributions to the sciences; both theoretical and applied papers, of original or expository type, are included.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信