理论二:深度学习和优化

IF 1.2 4区工程技术 Q3 ENGINEERING, MULTIDISCIPLINARY

Bulletin of the Polish Academy of Sciences-Technical Sciences Pub Date : 2023-11-06 DOI:10.24425/BPAS.2018.125925

T. Poggio, Q. Liao

{"title":"理论二:深度学习和优化","authors":"T. Poggio, Q. Liao","doi":"10.24425/BPAS.2018.125925","DOIUrl":null,"url":null,"abstract":"Bull. Pol. Ac.: Tech. 66(6) 2018 Abstract. The landscape of the empirical risk of overparametrized deep convolutional neural networks (DCNNs) is characterized with a mix of theory and experiments. In part A we show the existence of a large number of global minimizers with zero empirical error (modulo inconsistent equations). The argument which relies on the use of Bezout theorem is rigorous when the RELUs are replaced by a polynomial nonlinearity. We show with simulations that the corresponding polynomial network is indistinguishable from the RELU network. According to Bezout theorem, the global minimizers are degenerate unlike the local minima which in general should be non-degenerate. Further we experimentally analyzed and visualized the landscape of empirical risk of DCNNs on CIFAR-10 dataset. Based on above theoretical and experimental observations, we propose a simple model of the landscape of empirical risk. In part B, we characterize the optimization properties of stochastic gradient descent applied to deep networks. The main claim here consists of theoretical and experimental evidence for the following property of SGD: SGD concentrates in probability – like the classical Langevin equation – on large volume, ”flat” minima, selecting with high probability degenerate minimizers which are typically global minimizers.","PeriodicalId":55299,"journal":{"name":"Bulletin of the Polish Academy of Sciences-Technical Sciences","volume":"9 1","pages":""},"PeriodicalIF":1.2000,"publicationDate":"2023-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"Theory II: Deep learning and optimization\",\"authors\":\"T. Poggio, Q. Liao\",\"doi\":\"10.24425/BPAS.2018.125925\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Bull. Pol. Ac.: Tech. 66(6) 2018 Abstract. The landscape of the empirical risk of overparametrized deep convolutional neural networks (DCNNs) is characterized with a mix of theory and experiments. In part A we show the existence of a large number of global minimizers with zero empirical error (modulo inconsistent equations). The argument which relies on the use of Bezout theorem is rigorous when the RELUs are replaced by a polynomial nonlinearity. We show with simulations that the corresponding polynomial network is indistinguishable from the RELU network. According to Bezout theorem, the global minimizers are degenerate unlike the local minima which in general should be non-degenerate. Further we experimentally analyzed and visualized the landscape of empirical risk of DCNNs on CIFAR-10 dataset. Based on above theoretical and experimental observations, we propose a simple model of the landscape of empirical risk. In part B, we characterize the optimization properties of stochastic gradient descent applied to deep networks. The main claim here consists of theoretical and experimental evidence for the following property of SGD: SGD concentrates in probability – like the classical Langevin equation – on large volume, ”flat” minima, selecting with high probability degenerate minimizers which are typically global minimizers.\",\"PeriodicalId\":55299,\"journal\":{\"name\":\"Bulletin of the Polish Academy of Sciences-Technical Sciences\",\"volume\":\"9 1\",\"pages\":\"\"},\"PeriodicalIF\":1.2000,\"publicationDate\":\"2023-11-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Bulletin of the Polish Academy of Sciences-Technical Sciences\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://doi.org/10.24425/BPAS.2018.125925\",\"RegionNum\":4,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"ENGINEERING, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bulletin of the Polish Academy of Sciences-Technical Sciences","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.24425/BPAS.2018.125925","RegionNum":4,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ENGINEERING, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 8

摘要

公牛。波尔。通信技术，66(6)2018过度参数化深度卷积神经网络(DCNNs)的经验风险景观具有理论和实验的混合特征。在A部分中，我们展示了大量具有零经验误差(模不一致方程)的全局极小值的存在性。当用多项式非线性代替relu时，依靠Bezout定理的论证是严格的。我们通过仿真证明了相应的多项式网络与RELU网络是无法区分的。根据Bezout定理，全局最小值是简并的，而局部最小值通常是不简并的。此外，我们还对CIFAR-10数据集上的DCNNs的经验风险进行了实验分析和可视化。基于上述理论和实验观察，我们提出了一个简单的经验风险景观模型。在B部分中，我们描述了应用于深度网络的随机梯度下降的优化特性。这里的主要主张包括理论和实验证据，证明了SGD的以下性质:SGD集中在概率上——像经典的朗之万方程一样——集中在大体积的“平坦”极小值上，选择具有高概率的退化极小值，这是典型的全局极小值。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Theory II: Deep learning and optimization

Bull. Pol. Ac.: Tech. 66(6) 2018 Abstract. The landscape of the empirical risk of overparametrized deep convolutional neural networks (DCNNs) is characterized with a mix of theory and experiments. In part A we show the existence of a large number of global minimizers with zero empirical error (modulo inconsistent equations). The argument which relies on the use of Bezout theorem is rigorous when the RELUs are replaced by a polynomial nonlinearity. We show with simulations that the corresponding polynomial network is indistinguishable from the RELU network. According to Bezout theorem, the global minimizers are degenerate unlike the local minima which in general should be non-degenerate. Further we experimentally analyzed and visualized the landscape of empirical risk of DCNNs on CIFAR-10 dataset. Based on above theoretical and experimental observations, we propose a simple model of the landscape of empirical risk. In part B, we characterize the optimization properties of stochastic gradient descent applied to deep networks. The main claim here consists of theoretical and experimental evidence for the following property of SGD: SGD concentrates in probability – like the classical Langevin equation – on large volume, ”flat” minima, selecting with high probability degenerate minimizers which are typically global minimizers.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Bulletin of the Polish Academy of Sciences-Technical Sciences 工程技术-工程：综合

CiteScore

2.80

自引率

16.70%

发文量

审稿时长

6-12 weeks

期刊介绍： The Bulletin of the Polish Academy of Sciences: Technical Sciences is published bimonthly by the Division IV Engineering Sciences of the Polish Academy of Sciences, since the beginning of the existence of the PAS in 1952. The journal is peer‐reviewed and is published both in printed and electronic form. It is established for the publication of original high quality papers from multidisciplinary Engineering sciences with the following topics preferred: Artificial and Computational Intelligence, Biomedical Engineering and Biotechnology, Civil Engineering, Control, Informatics and Robotics, Electronics, Telecommunication and Optoelectronics, Mechanical and Aeronautical Engineering, Thermodynamics, Material Science and Nanotechnology, Power Systems and Power Electronics.