Understanding the training of infinitely deep and wide ResNets with conditional optimal transport

IF 3.1 1区数学 Q1 MATHEMATICS

Communications on Pure and Applied Mathematics Pub Date : 2025-06-20 DOI:10.1002/cpa.70004

Raphaël Barboni, Gabriel Peyré, François‐Xavier Vialard

{"title":"Understanding the training of infinitely deep and wide ResNets with conditional optimal transport","authors":"Raphaël Barboni, Gabriel Peyré, François‐Xavier Vialard","doi":"10.1002/cpa.70004","DOIUrl":null,"url":null,"abstract":"We study the convergence of gradient flow for the training of deep neural networks. While residual neural networks (ResNet) are a popular example of very deep architectures, their training constitutes a challenging optimization problem, notably due to the non‐convexity and the non‐coercivity of the objective. Yet, in applications, such tasks are successfully solved by simple optimization algorithms such as gradient descent. To better understand this phenomenon, we focus here on a “mean‐field” model of an infinitely deep and arbitrarily wide ResNet, parameterized by probability measures on the product set of layers and parameters, and with constant marginal on the set of layers. Indeed, in the case of shallow neural networks, mean field models have been proven to benefit from simplified loss landscapes and good theoretical guarantees when trained with gradient flow w.r.t. the Wasserstein metric on the set of probability measures. Motivated by this approach, we propose to train our model with gradient flow w.r.t. the conditional optimal transport (COT) distance: a restriction of the classical Wasserstein distance which enforces our marginal condition. Relying on the theory of gradient flows in metric spaces, we first show the well‐posedness of the gradient flow equation and its consistency with the training of ResNets at finite width. Performing a local Polyak–Łojasiewicz analysis, we then show convergence of the gradient flow for well‐chosen initializations: if the number of features is finite but sufficiently large and the risk is sufficiently small at initialization, the gradient flow converges to a global minimizer. This is the first result of this type for infinitely deep and arbitrarily wide ResNets. In addition, this work is an opportunity to study in more detail the COT metric, particularly its dynamic formulation. Some of our results in this direction might be interesting on their own.","PeriodicalId":10601,"journal":{"name":"Communications on Pure and Applied Mathematics","volume":"29 1","pages":""},"PeriodicalIF":3.1000,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Communications on Pure and Applied Mathematics","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1002/cpa.70004","RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICS","Score":null,"Total":0}

引用次数: 0

Abstract

We study the convergence of gradient flow for the training of deep neural networks. While residual neural networks (ResNet) are a popular example of very deep architectures, their training constitutes a challenging optimization problem, notably due to the non‐convexity and the non‐coercivity of the objective. Yet, in applications, such tasks are successfully solved by simple optimization algorithms such as gradient descent. To better understand this phenomenon, we focus here on a “mean‐field” model of an infinitely deep and arbitrarily wide ResNet, parameterized by probability measures on the product set of layers and parameters, and with constant marginal on the set of layers. Indeed, in the case of shallow neural networks, mean field models have been proven to benefit from simplified loss landscapes and good theoretical guarantees when trained with gradient flow w.r.t. the Wasserstein metric on the set of probability measures. Motivated by this approach, we propose to train our model with gradient flow w.r.t. the conditional optimal transport (COT) distance: a restriction of the classical Wasserstein distance which enforces our marginal condition. Relying on the theory of gradient flows in metric spaces, we first show the well‐posedness of the gradient flow equation and its consistency with the training of ResNets at finite width. Performing a local Polyak–Łojasiewicz analysis, we then show convergence of the gradient flow for well‐chosen initializations: if the number of features is finite but sufficiently large and the risk is sufficiently small at initialization, the gradient flow converges to a global minimizer. This is the first result of this type for infinitely deep and arbitrarily wide ResNets. In addition, this work is an opportunity to study in more detail the COT metric, particularly its dynamic formulation. Some of our results in this direction might be interesting on their own.

查看原文本刊更多论文

了解具有条件最优传输的无限深和无限宽ResNets的训练

我们研究了梯度流的收敛性，用于深度神经网络的训练。虽然残余神经网络（ResNet）是非常深入的架构的一个流行的例子，但它们的训练构成了一个具有挑战性的优化问题，特别是由于目标的非凸性和非矫顽性。然而，在应用中，这样的任务可以通过简单的优化算法（如梯度下降）成功地解决。为了更好地理解这一现象，我们将重点放在一个无限深和任意宽的ResNet的“平均场”模型上，该模型通过层和参数积集的概率度量来参数化，并在层集上具有恒定的边际。事实上，在浅层神经网络的情况下，平均场模型已经被证明可以从简化的损失景观和良好的理论保证中受益，当使用梯度流w.r.t.概率度量集上的Wasserstein度量时。在这种方法的激励下，我们建议用梯度流训练我们的模型，而不是条件最优输运（COT）距离：经典Wasserstein距离的限制，它强制执行我们的边际条件。基于度量空间中的梯度流动理论，我们首先证明了梯度流动方程的适定性及其与有限宽度下ResNets训练的一致性。执行局部Polyak -Łojasiewicz分析，然后我们展示了梯度流对于精心选择的初始化的收敛性：如果特征数量有限但足够大，并且初始化时风险足够小，梯度流收敛到全局最小化。这是该类型对于无限深和任意宽的resnet的第一个结果。此外，这项工作为更详细地研究COT度量，特别是其动态公式提供了机会。我们在这个方向上的一些结果本身可能很有趣。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Communications on Pure and Applied Mathematics 数学-数学

CiteScore

6.70

自引率

3.30%

发文量

审稿时长

>12 weeks

期刊介绍： Communications on Pure and Applied Mathematics (ISSN 0010-3640) is published monthly, one volume per year, by John Wiley & Sons, Inc. © 2019. The journal primarily publishes papers originating at or solicited by the Courant Institute of Mathematical Sciences. It features recent developments in applied mathematics, mathematical physics, and mathematical analysis. The topics include partial differential equations, computer science, and applied mathematics. CPAM is devoted to mathematical contributions to the sciences; both theoretical and applied papers, of original or expository type, are included.