SOAP: Improving and Stabilizing Shampoo using Adam

arXiv - CS - Machine Learning Pub Date : 2024-09-17 DOI:arxiv-2409.11321

Nikhil Vyas, Depen Morwani, Rosie Zhao, Itai Shapira, David Brandfonbrener, Lucas Janson, Sham Kakade

{"title":"SOAP: Improving and Stabilizing Shampoo using Adam","authors":"Nikhil Vyas, Depen Morwani, Rosie Zhao, Itai Shapira, David Brandfonbrener, Lucas Janson, Sham Kakade","doi":"arxiv-2409.11321","DOIUrl":null,"url":null,"abstract":"There is growing evidence of the effectiveness of Shampoo, a higher-order\npreconditioning method, over Adam in deep learning optimization tasks. However,\nShampoo's drawbacks include additional hyperparameters and computational\noverhead when compared to Adam, which only updates running averages of first-\nand second-moment quantities. This work establishes a formal connection between\nShampoo (implemented with the 1/2 power) and Adafactor -- a memory-efficient\napproximation of Adam -- showing that Shampoo is equivalent to running\nAdafactor in the eigenbasis of Shampoo's preconditioner. This insight leads to\nthe design of a simpler and computationally efficient algorithm:\n$\\textbf{S}$hampo$\\textbf{O}$ with $\\textbf{A}$dam in the\n$\\textbf{P}$reconditioner's eigenbasis (SOAP). With regards to improving Shampoo's computational efficiency, the most\nstraightforward approach would be to simply compute Shampoo's\neigendecomposition less frequently. Unfortunately, as our empirical results\nshow, this leads to performance degradation that worsens with this frequency.\nSOAP mitigates this degradation by continually updating the running average of\nthe second moment, just as Adam does, but in the current (slowly changing)\ncoordinate basis. Furthermore, since SOAP is equivalent to running Adam in a\nrotated space, it introduces only one additional hyperparameter (the\npreconditioning frequency) compared to Adam. We empirically evaluate SOAP on\nlanguage model pre-training with 360m and 660m sized models. In the large batch\nregime, SOAP reduces the number of iterations by over 40% and wall clock time\nby over 35% compared to AdamW, with approximately 20% improvements in both\nmetrics compared to Shampoo. An implementation of SOAP is available at\nhttps://github.com/nikhilvyas/SOAP.","PeriodicalId":501301,"journal":{"name":"arXiv - CS - Machine Learning","volume":"205 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11321","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

There is growing evidence of the effectiveness of Shampoo, a higher-order preconditioning method, over Adam in deep learning optimization tasks. However, Shampoo's drawbacks include additional hyperparameters and computational overhead when compared to Adam, which only updates running averages of first- and second-moment quantities. This work establishes a formal connection between Shampoo (implemented with the 1/2 power) and Adafactor -- a memory-efficient approximation of Adam -- showing that Shampoo is equivalent to running Adafactor in the eigenbasis of Shampoo's preconditioner. This insight leads to the design of a simpler and computationally efficient algorithm: $\textbf{S}$hampo$\textbf{O}$ with $\textbf{A}$dam in the $\textbf{P}$reconditioner's eigenbasis (SOAP). With regards to improving Shampoo's computational efficiency, the most straightforward approach would be to simply compute Shampoo's eigendecomposition less frequently. Unfortunately, as our empirical results show, this leads to performance degradation that worsens with this frequency. SOAP mitigates this degradation by continually updating the running average of the second moment, just as Adam does, but in the current (slowly changing) coordinate basis. Furthermore, since SOAP is equivalent to running Adam in a rotated space, it introduces only one additional hyperparameter (the preconditioning frequency) compared to Adam. We empirically evaluate SOAP on language model pre-training with 360m and 660m sized models. In the large batch regime, SOAP reduces the number of iterations by over 40% and wall clock time by over 35% compared to AdamW, with approximately 20% improvements in both metrics compared to Shampoo. An implementation of SOAP is available at https://github.com/nikhilvyas/SOAP.

查看原文本刊更多论文

SOAP：使用亚当改进和稳定洗发水

越来越多的证据表明，在深度学习优化任务中，高阶预处理方法香波比亚当更有效。然而，与亚当相比，香波的缺点包括额外的超参数和计算开销，因为亚当只更新一瞬和二瞬量的运行平均值。这项工作建立了香波（用 1/2幂实现）与 Adafactor（Adam 的内存系数近似值）之间的正式联系，表明香波等同于在香波前置条件器的特征基础上运行 Adafactor。这一洞察力促使我们设计出一种更简单、计算效率更高的算法：在$textbf{P}$预条件器的特征基础（SOAP）上，用$textbf{A}$亚当运行$textbf{S}$香波$textbf{O}$。关于提高香波的计算效率，最直接的方法就是减少香波的自分解计算频率。但是，正如我们的实证结果所显示的那样，这样做会导致性能下降，而且随着计算频率的增加，性能下降的情况会越来越严重。SOAP 通过不断更新第二时刻的运行平均值来缓解这种性能下降的情况，就像亚当所做的那样，但它是以当前（缓慢变化的）坐标为基础的。此外，由于 SOAP 等同于在旋转空间中运行 Adam，因此与 Adam 相比，它只引入了一个额外的超参数（预处理频率）。我们用 360m 和 660m 大小的模型对 SOAP 的语言模型预训练进行了实证评估。与 AdamW 相比，SOAP 减少了 40% 以上的迭代次数，减少了 35% 以上的挂钟时间；与 Shampoo 相比，SOAP 在这两项指标上都有约 20% 的改进。SOAP 的实现可在https://github.com/nikhilvyas/SOAP。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Machine Learning

自引率

0.00%

发文量