Raphaël Barboni, Gabriel Peyré, François‐Xavier Vialard
{"title":"Understanding the training of infinitely deep and wide ResNets with conditional optimal transport","authors":"Raphaël Barboni, Gabriel Peyré, François‐Xavier Vialard","doi":"10.1002/cpa.70004","DOIUrl":null,"url":null,"abstract":"We study the convergence of gradient flow for the training of deep neural networks. While residual neural networks (ResNet) are a popular example of very deep architectures, their training constitutes a challenging optimization problem, notably due to the non‐convexity and the non‐coercivity of the objective. Yet, in applications, such tasks are successfully solved by simple optimization algorithms such as gradient descent. To better understand this phenomenon, we focus here on a “mean‐field” model of an infinitely deep and arbitrarily wide ResNet, parameterized by probability measures on the product set of layers and parameters, and with constant marginal on the set of layers. Indeed, in the case of shallow neural networks, mean field models have been proven to benefit from simplified loss landscapes and good theoretical guarantees when trained with gradient flow w.r.t. the Wasserstein metric on the set of probability measures. Motivated by this approach, we propose to train our model with gradient flow w.r.t. the conditional optimal transport (COT) distance: a restriction of the classical Wasserstein distance which enforces our marginal condition. Relying on the theory of gradient flows in metric spaces, we first show the well‐posedness of the gradient flow equation and its consistency with the training of ResNets at finite width. Performing a local Polyak–Łojasiewicz analysis, we then show convergence of the gradient flow for well‐chosen initializations: if the number of features is finite but sufficiently large and the risk is sufficiently small at initialization, the gradient flow converges to a global minimizer. This is the first result of this type for infinitely deep and arbitrarily wide ResNets. In addition, this work is an opportunity to study in more detail the COT metric, particularly its dynamic formulation. Some of our results in this direction might be interesting on their own.","PeriodicalId":10601,"journal":{"name":"Communications on Pure and Applied Mathematics","volume":"29 1","pages":""},"PeriodicalIF":3.1000,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Communications on Pure and Applied Mathematics","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1002/cpa.70004","RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICS","Score":null,"Total":0}
引用次数: 0
Abstract
We study the convergence of gradient flow for the training of deep neural networks. While residual neural networks (ResNet) are a popular example of very deep architectures, their training constitutes a challenging optimization problem, notably due to the non‐convexity and the non‐coercivity of the objective. Yet, in applications, such tasks are successfully solved by simple optimization algorithms such as gradient descent. To better understand this phenomenon, we focus here on a “mean‐field” model of an infinitely deep and arbitrarily wide ResNet, parameterized by probability measures on the product set of layers and parameters, and with constant marginal on the set of layers. Indeed, in the case of shallow neural networks, mean field models have been proven to benefit from simplified loss landscapes and good theoretical guarantees when trained with gradient flow w.r.t. the Wasserstein metric on the set of probability measures. Motivated by this approach, we propose to train our model with gradient flow w.r.t. the conditional optimal transport (COT) distance: a restriction of the classical Wasserstein distance which enforces our marginal condition. Relying on the theory of gradient flows in metric spaces, we first show the well‐posedness of the gradient flow equation and its consistency with the training of ResNets at finite width. Performing a local Polyak–Łojasiewicz analysis, we then show convergence of the gradient flow for well‐chosen initializations: if the number of features is finite but sufficiently large and the risk is sufficiently small at initialization, the gradient flow converges to a global minimizer. This is the first result of this type for infinitely deep and arbitrarily wide ResNets. In addition, this work is an opportunity to study in more detail the COT metric, particularly its dynamic formulation. Some of our results in this direction might be interesting on their own.