mL-BFGS: A Momentum-based L-BFGS for Distributed Large-Scale Neural Network Optimization.

Transactions on machine learning research Pub Date : 2023-08-01

Yue Niu, Zalan Fabian, Sunwoo Lee, Mahdi Soltanolkotabi, Salman Avestimehr

{"title":"mL-BFGS: A Momentum-based L-BFGS for Distributed Large-Scale Neural Network Optimization.","authors":"Yue Niu, Zalan Fabian, Sunwoo Lee, Mahdi Soltanolkotabi, Salman Avestimehr","doi":"","DOIUrl":null,"url":null,"abstract":"<p><p>Quasi-Newton methods still face significant challenges in training large-scale neural networks due to additional compute costs in the Hessian related computations and instability issues in stochastic training. A well-known method, L-BFGS that efficiently approximates the Hessian using history parameter and gradient changes, suffers convergence instability in stochastic training. So far, attempts that adapt L-BFGS to large-scale stochastic training incur considerable extra overhead, which offsets its convergence benefits in wall-clock time. In this paper, we propose mL-BFGS, a lightweight momentum-based L-BFGS algorithm that paves the way for quasi-Newton (QN) methods in large-scale distributed deep neural network (DNN) optimization. mL-BFGS introduces a nearly cost-free momentum scheme into L-BFGS update and greatly reduces stochastic noise in the Hessian, therefore stabilizing convergence during stochastic optimization. For model training at a large scale, mL-BFGS approximates a block-wise Hessian, thus enabling distributing compute and memory costs across all computing nodes. We provide a supporting convergence analysis for mL-BFGS in stochastic settings. To investigate mL-BFGS's potential in large-scale DNN training, we train benchmark neural models using mL-BFGS and compare performance with baselines (SGD, Adam, and other quasi-Newton methods). Results show that mL-BFGS achieves both noticeable iteration-wise and wall-clock speedup.</p>","PeriodicalId":75238,"journal":{"name":"Transactions on machine learning research","volume":"2023 ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12393816/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Transactions on machine learning research","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Quasi-Newton methods still face significant challenges in training large-scale neural networks due to additional compute costs in the Hessian related computations and instability issues in stochastic training. A well-known method, L-BFGS that efficiently approximates the Hessian using history parameter and gradient changes, suffers convergence instability in stochastic training. So far, attempts that adapt L-BFGS to large-scale stochastic training incur considerable extra overhead, which offsets its convergence benefits in wall-clock time. In this paper, we propose mL-BFGS, a lightweight momentum-based L-BFGS algorithm that paves the way for quasi-Newton (QN) methods in large-scale distributed deep neural network (DNN) optimization. mL-BFGS introduces a nearly cost-free momentum scheme into L-BFGS update and greatly reduces stochastic noise in the Hessian, therefore stabilizing convergence during stochastic optimization. For model training at a large scale, mL-BFGS approximates a block-wise Hessian, thus enabling distributing compute and memory costs across all computing nodes. We provide a supporting convergence analysis for mL-BFGS in stochastic settings. To investigate mL-BFGS's potential in large-scale DNN training, we train benchmark neural models using mL-BFGS and compare performance with baselines (SGD, Adam, and other quasi-Newton methods). Results show that mL-BFGS achieves both noticeable iteration-wise and wall-clock speedup.

本刊更多论文

mL-BFGS：用于分布式大规模神经网络优化的基于动量的L-BFGS。

由于Hessian相关计算的额外计算成本和随机训练中的不稳定性问题，拟牛顿方法在训练大规模神经网络方面仍然面临重大挑战。利用历史参数和梯度变化有效逼近Hessian的L-BFGS方法在随机训练中存在收敛不稳定性。到目前为止，使L-BFGS适应大规模随机训练的尝试会产生相当大的额外开销，这抵消了它在时钟时间内的收敛优势。本文提出了一种轻量级的基于动量的L-BFGS算法mL-BFGS，为大规模分布式深度神经网络（DNN）优化中的准牛顿（QN）方法铺平了道路。mL-BFGS在L-BFGS更新中引入了一种几乎无成本的动量格式，大大降低了Hessian中的随机噪声，从而稳定了随机优化过程中的收敛性。对于大规模的模型训练，mL-BFGS近似于逐块的Hessian，从而能够在所有计算节点上分配计算和内存成本。我们提供了一个支持mL-BFGS在随机设置下的收敛性分析。为了研究mL-BFGS在大规模深度神经网络训练中的潜力，我们使用mL-BFGS训练基准神经模型，并将性能与基线（SGD、Adam和其他准牛顿方法）进行比较。结果表明，mL-BFGS实现了明显的迭代加速和时钟加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Transactions on machine learning research

自引率

0.00%

发文量