Clemens Karner, Vladimir Kazeev, Philipp Christian Petersen
{"title":"反向传播的数值不稳定性导致神经网络训练的局限性","authors":"Clemens Karner, Vladimir Kazeev, Philipp Christian Petersen","doi":"10.1007/s10444-024-10106-x","DOIUrl":null,"url":null,"abstract":"<div><p>We study the training of deep neural networks by gradient descent where floating-point arithmetic is used to compute the gradients. In this framework and under realistic assumptions, we demonstrate that it is <i>highly unlikely</i> to find ReLU neural networks that maintain, in the course of training with gradient descent, <i>superlinearly</i> many affine pieces with respect to their number of layers. In virtually all approximation theoretical arguments which yield high order polynomial rates of approximation, sequences of ReLU neural networks with <i>exponentially</i> many affine pieces compared to their numbers of layers are used. As a consequence, we conclude that approximating sequences of ReLU neural networks resulting from gradient descent in practice differ substantially from theoretically constructed sequences. The assumptions and the theoretical results are compared to a numerical study, which yields concurring results.</p></div>","PeriodicalId":50869,"journal":{"name":"Advances in Computational Mathematics","volume":"50 1","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2024-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10444-024-10106-x.pdf","citationCount":"0","resultStr":"{\"title\":\"Limitations of neural network training due to numerical instability of backpropagation\",\"authors\":\"Clemens Karner, Vladimir Kazeev, Philipp Christian Petersen\",\"doi\":\"10.1007/s10444-024-10106-x\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>We study the training of deep neural networks by gradient descent where floating-point arithmetic is used to compute the gradients. In this framework and under realistic assumptions, we demonstrate that it is <i>highly unlikely</i> to find ReLU neural networks that maintain, in the course of training with gradient descent, <i>superlinearly</i> many affine pieces with respect to their number of layers. In virtually all approximation theoretical arguments which yield high order polynomial rates of approximation, sequences of ReLU neural networks with <i>exponentially</i> many affine pieces compared to their numbers of layers are used. As a consequence, we conclude that approximating sequences of ReLU neural networks resulting from gradient descent in practice differ substantially from theoretically constructed sequences. The assumptions and the theoretical results are compared to a numerical study, which yields concurring results.</p></div>\",\"PeriodicalId\":50869,\"journal\":{\"name\":\"Advances in Computational Mathematics\",\"volume\":\"50 1\",\"pages\":\"\"},\"PeriodicalIF\":1.7000,\"publicationDate\":\"2024-02-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://link.springer.com/content/pdf/10.1007/s10444-024-10106-x.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Advances in Computational Mathematics\",\"FirstCategoryId\":\"100\",\"ListUrlMain\":\"https://link.springer.com/article/10.1007/s10444-024-10106-x\",\"RegionNum\":3,\"RegionCategory\":\"数学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"MATHEMATICS, APPLIED\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advances in Computational Mathematics","FirstCategoryId":"100","ListUrlMain":"https://link.springer.com/article/10.1007/s10444-024-10106-x","RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MATHEMATICS, APPLIED","Score":null,"Total":0}
Limitations of neural network training due to numerical instability of backpropagation
We study the training of deep neural networks by gradient descent where floating-point arithmetic is used to compute the gradients. In this framework and under realistic assumptions, we demonstrate that it is highly unlikely to find ReLU neural networks that maintain, in the course of training with gradient descent, superlinearly many affine pieces with respect to their number of layers. In virtually all approximation theoretical arguments which yield high order polynomial rates of approximation, sequences of ReLU neural networks with exponentially many affine pieces compared to their numbers of layers are used. As a consequence, we conclude that approximating sequences of ReLU neural networks resulting from gradient descent in practice differ substantially from theoretically constructed sequences. The assumptions and the theoretical results are compared to a numerical study, which yields concurring results.
期刊介绍:
Advances in Computational Mathematics publishes high quality, accessible and original articles at the forefront of computational and applied mathematics, with a clear potential for impact across the sciences. The journal emphasizes three core areas: approximation theory and computational geometry; numerical analysis, modelling and simulation; imaging, signal processing and data analysis.
This journal welcomes papers that are accessible to a broad audience in the mathematical sciences and that show either an advance in computational methodology or a novel scientific application area, or both. Methods papers should rely on rigorous analysis and/or convincing numerical studies.