Understanding the training of infinitely deep and wide ResNets with conditional optimal transport

IF 3.1 1区 数学 Q1 MATHEMATICS
Raphaël Barboni, Gabriel Peyré, François‐Xavier Vialard
{"title":"Understanding the training of infinitely deep and wide ResNets with conditional optimal transport","authors":"Raphaël Barboni, Gabriel Peyré, François‐Xavier Vialard","doi":"10.1002/cpa.70004","DOIUrl":null,"url":null,"abstract":"We study the convergence of gradient flow for the training of deep neural networks. While residual neural networks (ResNet) are a popular example of very deep architectures, their training constitutes a challenging optimization problem, notably due to the non‐convexity and the non‐coercivity of the objective. Yet, in applications, such tasks are successfully solved by simple optimization algorithms such as gradient descent. To better understand this phenomenon, we focus here on a “mean‐field” model of an infinitely deep and arbitrarily wide ResNet, parameterized by probability measures on the product set of layers and parameters, and with constant marginal on the set of layers. Indeed, in the case of shallow neural networks, mean field models have been proven to benefit from simplified loss landscapes and good theoretical guarantees when trained with gradient flow w.r.t. the Wasserstein metric on the set of probability measures. Motivated by this approach, we propose to train our model with gradient flow w.r.t. the conditional optimal transport (COT) distance: a restriction of the classical Wasserstein distance which enforces our marginal condition. Relying on the theory of gradient flows in metric spaces, we first show the well‐posedness of the gradient flow equation and its consistency with the training of ResNets at finite width. Performing a local Polyak–Łojasiewicz analysis, we then show convergence of the gradient flow for well‐chosen initializations: if the number of features is finite but sufficiently large and the risk is sufficiently small at initialization, the gradient flow converges to a global minimizer. This is the first result of this type for infinitely deep and arbitrarily wide ResNets. In addition, this work is an opportunity to study in more detail the COT metric, particularly its dynamic formulation. Some of our results in this direction might be interesting on their own.","PeriodicalId":10601,"journal":{"name":"Communications on Pure and Applied Mathematics","volume":"29 1","pages":""},"PeriodicalIF":3.1000,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Communications on Pure and Applied Mathematics","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1002/cpa.70004","RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICS","Score":null,"Total":0}
引用次数: 0

Abstract

We study the convergence of gradient flow for the training of deep neural networks. While residual neural networks (ResNet) are a popular example of very deep architectures, their training constitutes a challenging optimization problem, notably due to the non‐convexity and the non‐coercivity of the objective. Yet, in applications, such tasks are successfully solved by simple optimization algorithms such as gradient descent. To better understand this phenomenon, we focus here on a “mean‐field” model of an infinitely deep and arbitrarily wide ResNet, parameterized by probability measures on the product set of layers and parameters, and with constant marginal on the set of layers. Indeed, in the case of shallow neural networks, mean field models have been proven to benefit from simplified loss landscapes and good theoretical guarantees when trained with gradient flow w.r.t. the Wasserstein metric on the set of probability measures. Motivated by this approach, we propose to train our model with gradient flow w.r.t. the conditional optimal transport (COT) distance: a restriction of the classical Wasserstein distance which enforces our marginal condition. Relying on the theory of gradient flows in metric spaces, we first show the well‐posedness of the gradient flow equation and its consistency with the training of ResNets at finite width. Performing a local Polyak–Łojasiewicz analysis, we then show convergence of the gradient flow for well‐chosen initializations: if the number of features is finite but sufficiently large and the risk is sufficiently small at initialization, the gradient flow converges to a global minimizer. This is the first result of this type for infinitely deep and arbitrarily wide ResNets. In addition, this work is an opportunity to study in more detail the COT metric, particularly its dynamic formulation. Some of our results in this direction might be interesting on their own.
了解具有条件最优传输的无限深和无限宽ResNets的训练
我们研究了梯度流的收敛性,用于深度神经网络的训练。虽然残余神经网络(ResNet)是非常深入的架构的一个流行的例子,但它们的训练构成了一个具有挑战性的优化问题,特别是由于目标的非凸性和非矫顽性。然而,在应用中,这样的任务可以通过简单的优化算法(如梯度下降)成功地解决。为了更好地理解这一现象,我们将重点放在一个无限深和任意宽的ResNet的“平均场”模型上,该模型通过层和参数积集的概率度量来参数化,并在层集上具有恒定的边际。事实上,在浅层神经网络的情况下,平均场模型已经被证明可以从简化的损失景观和良好的理论保证中受益,当使用梯度流w.r.t.概率度量集上的Wasserstein度量时。在这种方法的激励下,我们建议用梯度流训练我们的模型,而不是条件最优输运(COT)距离:经典Wasserstein距离的限制,它强制执行我们的边际条件。基于度量空间中的梯度流动理论,我们首先证明了梯度流动方程的适定性及其与有限宽度下ResNets训练的一致性。执行局部Polyak -Łojasiewicz分析,然后我们展示了梯度流对于精心选择的初始化的收敛性:如果特征数量有限但足够大,并且初始化时风险足够小,梯度流收敛到全局最小化。这是该类型对于无限深和任意宽的resnet的第一个结果。此外,这项工作为更详细地研究COT度量,特别是其动态公式提供了机会。我们在这个方向上的一些结果本身可能很有趣。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
6.70
自引率
3.30%
发文量
59
审稿时长
>12 weeks
期刊介绍: Communications on Pure and Applied Mathematics (ISSN 0010-3640) is published monthly, one volume per year, by John Wiley & Sons, Inc. © 2019. The journal primarily publishes papers originating at or solicited by the Courant Institute of Mathematical Sciences. It features recent developments in applied mathematics, mathematical physics, and mathematical analysis. The topics include partial differential equations, computer science, and applied mathematics. CPAM is devoted to mathematical contributions to the sciences; both theoretical and applied papers, of original or expository type, are included.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信