The Convex Landscape of Neural Networks: Characterizing Global Optima and Stationary Points via Lasso Models

IF 2.2 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Information Theory Pub Date : 2025-02-25 DOI:10.1109/TIT.2025.3545564

Tolga Ergen;Mert Pilanci

{"title":"The Convex Landscape of Neural Networks: Characterizing Global Optima and Stationary Points via Lasso Models","authors":"Tolga Ergen;Mert Pilanci","doi":"10.1109/TIT.2025.3545564","DOIUrl":null,"url":null,"abstract":"Due to the non-convex nature of training Deep Neural Network (DNN) models, their effectiveness relies on the use of non-convex optimization heuristics. Traditional methods for training DNNs often require costly empirical methods to produce successful models and do not have a clear theoretical foundation. In this study, we examine the use of convex optimization theory and sparse recovery models to refine the training process of neural networks and provide a better interpretation of their optimal weights. We focus on training two-layer neural networks with piecewise linear activations and demonstrate that they can be formulated as a finite-dimensional convex program. These programs include a regularization term that promotes sparsity, which constitutes a variant of group Lasso. We first utilize semi-infinite programming theory to prove strong duality for finite width neural networks and then we express these architectures equivalently as high dimensional convex sparse recovery models. Remarkably, the worst-case complexity to solve the convex program is polynomial in the number of samples and number of neurons when the rank of the data matrix is bounded, which is the case in convolutional networks. To extend our method to training data of arbitrary rank, we develop a novel polynomial-time approximation scheme based on zonotope subsampling that comes with a guaranteed approximation ratio. We also show that all the stationary points of the nonconvex training objective can be characterized as the global optimum of a subsampled convex program. Our convex models can be trained using standard convex solvers without resorting to heuristics or extensive hyper-parameter tuning unlike non-convex methods. Due to the convexity, optimizer hyperparameters such as initialization, batch sizes, and step size schedules have no effect on the final model. Through extensive numerical experiments, we show that convex models can outperform traditional non-convex methods and are not sensitive to optimizer hyperparameters. The code for our experiments is available at <uri>https://github.com/pilancilab/convex_nn</uri>.","PeriodicalId":13494,"journal":{"name":"IEEE Transactions on Information Theory","volume":"71 5","pages":"3854-3870"},"PeriodicalIF":2.2000,"publicationDate":"2025-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Information Theory","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10902486/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Due to the non-convex nature of training Deep Neural Network (DNN) models, their effectiveness relies on the use of non-convex optimization heuristics. Traditional methods for training DNNs often require costly empirical methods to produce successful models and do not have a clear theoretical foundation. In this study, we examine the use of convex optimization theory and sparse recovery models to refine the training process of neural networks and provide a better interpretation of their optimal weights. We focus on training two-layer neural networks with piecewise linear activations and demonstrate that they can be formulated as a finite-dimensional convex program. These programs include a regularization term that promotes sparsity, which constitutes a variant of group Lasso. We first utilize semi-infinite programming theory to prove strong duality for finite width neural networks and then we express these architectures equivalently as high dimensional convex sparse recovery models. Remarkably, the worst-case complexity to solve the convex program is polynomial in the number of samples and number of neurons when the rank of the data matrix is bounded, which is the case in convolutional networks. To extend our method to training data of arbitrary rank, we develop a novel polynomial-time approximation scheme based on zonotope subsampling that comes with a guaranteed approximation ratio. We also show that all the stationary points of the nonconvex training objective can be characterized as the global optimum of a subsampled convex program. Our convex models can be trained using standard convex solvers without resorting to heuristics or extensive hyper-parameter tuning unlike non-convex methods. Due to the convexity, optimizer hyperparameters such as initialization, batch sizes, and step size schedules have no effect on the final model. Through extensive numerical experiments, we show that convex models can outperform traditional non-convex methods and are not sensitive to optimizer hyperparameters. The code for our experiments is available at https://github.com/pilancilab/convex_nn.

查看原文本刊更多论文

神经网络的凸景观：通过Lasso模型表征全局最优点和平稳点

由于训练深度神经网络（DNN）模型的非凸特性，其有效性依赖于非凸优化启发式的使用。传统的训练深度神经网络的方法往往需要昂贵的经验方法来产生成功的模型，并且没有明确的理论基础。在本研究中，我们研究了凸优化理论和稀疏恢复模型的使用，以改进神经网络的训练过程，并更好地解释其最优权重。我们专注于用分段线性激活训练两层神经网络，并证明它们可以被表述为有限维凸程序。这些程序包括一个正则化项，提高了稀疏性，它构成了拉索群的一个变体。我们首先利用半无限规划理论证明有限宽度神经网络的强对偶性，然后将这些结构等效地表示为高维凸稀疏恢复模型。值得注意的是，当数据矩阵的秩有界时，求解凸规划的最坏情况复杂度是样本数量和神经元数量的多项式，这是卷积网络的情况。为了将我们的方法扩展到任意秩的训练数据，我们开发了一种新的基于分区子抽样的多项式时间近似方案，该方案具有保证的近似比。我们还证明了非凸训练目标的所有平稳点都可以表征为下采样凸规划的全局最优。我们的凸模型可以使用标准凸求解器进行训练，而不需要像非凸方法那样使用启发式方法或广泛的超参数调优。由于凹凸性，优化器的超参数（如初始化、批处理大小和步长调度）对最终模型没有影响。通过大量的数值实验，我们表明凸模型可以优于传统的非凸方法，并且对优化器超参数不敏感。我们的实验代码可以在https://github.com/pilancilab/convex_nn上找到。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Information Theory 工程技术-工程：电子与电气

CiteScore

5.70

自引率

20.00%

发文量

514

审稿时长

12 months

期刊介绍： The IEEE Transactions on Information Theory is a journal that publishes theoretical and experimental papers concerned with the transmission, processing, and utilization of information. The boundaries of acceptable subject matter are intentionally not sharply delimited. Rather, it is hoped that as the focus of research activity changes, a flexible policy will permit this Transactions to follow suit. Current appropriate topics are best reflected by recent Tables of Contents; they are summarized in the titles of editorial areas that appear on the inside front cover.