1位LAMB:具有LAMB收敛速度的高效通信大规模批量训练

2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC) Pub Date : 2021-04-13 DOI:10.1109/HiPC56025.2022.00044

Conglong Li, A. Awan, Hanlin Tang, Samyam Rajbhandari, Yuxiong He

{"title":"1位LAMB:具有LAMB收敛速度的高效通信大规模批量训练","authors":"Conglong Li, A. Awan, Hanlin Tang, Samyam Rajbhandari, Yuxiong He","doi":"10.1109/HiPC56025.2022.00044","DOIUrl":null,"url":null,"abstract":"To train large machine learning models (like BERT and GPT-3) on hundreds of GPUs, communication has become a significant bottleneck, especially on commodity systems with limited-bandwidth TCP networks. On one side, large batch-size optimization such as the LAMB algorithm was proposed to reduce the frequency of communication. On the other side, communication compression algorithms such as 1-bit Adam help to reduce the volume of each communication. However, we find that simply using one of the techniques is not sufficient to solve the communication challenge, especially under low network bandwidth. Motivated by this we aim to combine the power of large-batch optimization and communication compression but we find that existing compression strategies cannot be directly applied to LAMB due to its unique adaptive layerwise learning rates. To this end, we design a new communication-efficient optimization algorithm, 1-bit LAMB, which introduces a novel way to support adaptive layerwise learning rates under compression. In addition to the algorithm and corresponding theoretical analysis, we propose three novel system implementations in order to achieve actual wall clock speedup: a momentum fusion mechanism to reduce the number of communications, a momentum scaling technique to reduce compression error, and a NCCL-based compressed communication backend to improve both usability and performance. For the BERT-Large pre-training task with batch sizes from 8K to 64K, our evaluations on up to 256 GPUs demonstrate that our optimized implementation of 1-bit LAMB is able to achieve up to 4.6x communication volume reduction, up to 2.8x end-to-end time-wise speedup, and the same sample-wise convergence speed (and same fine-tuning task accuracy) compared to uncompressed LAMB. Furthermore, 1-bit LAMB achieves the same accuracy as LAMB on computer vision tasks like ImageNet and CIFAR100.","PeriodicalId":119363,"journal":{"name":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":"{\"title\":\"1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB’s Convergence Speed\",\"authors\":\"Conglong Li, A. Awan, Hanlin Tang, Samyam Rajbhandari, Yuxiong He\",\"doi\":\"10.1109/HiPC56025.2022.00044\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"To train large machine learning models (like BERT and GPT-3) on hundreds of GPUs, communication has become a significant bottleneck, especially on commodity systems with limited-bandwidth TCP networks. On one side, large batch-size optimization such as the LAMB algorithm was proposed to reduce the frequency of communication. On the other side, communication compression algorithms such as 1-bit Adam help to reduce the volume of each communication. However, we find that simply using one of the techniques is not sufficient to solve the communication challenge, especially under low network bandwidth. Motivated by this we aim to combine the power of large-batch optimization and communication compression but we find that existing compression strategies cannot be directly applied to LAMB due to its unique adaptive layerwise learning rates. To this end, we design a new communication-efficient optimization algorithm, 1-bit LAMB, which introduces a novel way to support adaptive layerwise learning rates under compression. In addition to the algorithm and corresponding theoretical analysis, we propose three novel system implementations in order to achieve actual wall clock speedup: a momentum fusion mechanism to reduce the number of communications, a momentum scaling technique to reduce compression error, and a NCCL-based compressed communication backend to improve both usability and performance. For the BERT-Large pre-training task with batch sizes from 8K to 64K, our evaluations on up to 256 GPUs demonstrate that our optimized implementation of 1-bit LAMB is able to achieve up to 4.6x communication volume reduction, up to 2.8x end-to-end time-wise speedup, and the same sample-wise convergence speed (and same fine-tuning task accuracy) compared to uncompressed LAMB. Furthermore, 1-bit LAMB achieves the same accuracy as LAMB on computer vision tasks like ImageNet and CIFAR100.\",\"PeriodicalId\":119363,\"journal\":{\"name\":\"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)\",\"volume\":\"8 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-04-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"16\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HiPC56025.2022.00044\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HiPC56025.2022.00044","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 16

摘要

为了在数百个gpu上训练大型机器学习模型(如BERT和GPT-3)，通信已经成为一个重要的瓶颈，特别是在具有有限带宽TCP网络的商用系统上。一方面，为了降低通信频率，提出了LAMB算法等大批量优化算法。另一方面，通信压缩算法(如1位Adam)有助于减少每次通信的体积。然而，我们发现简单地使用其中一种技术不足以解决通信挑战，特别是在低网络带宽下。在此激励下，我们的目标是将大批量优化和通信压缩的能力结合起来，但我们发现现有的压缩策略不能直接应用于LAMB，因为它具有独特的自适应分层学习率。为此，我们设计了一种新的通信高效优化算法，1位LAMB，它引入了一种新的方式来支持压缩下的自适应分层学习率。除了算法和相应的理论分析之外，我们还提出了三种新的系统实现来实现实际的壁钟加速:一种动量融合机制来减少通信次数，一种动量缩放技术来减少压缩误差，以及一种基于nccl的压缩通信后端来提高可用性和性能。对于批大小从8K到64K的BERT-Large预训练任务，我们在多达256个gpu上的评估表明，与未压缩的LAMB相比，我们优化的1位LAMB实现能够实现高达4.6倍的通信量减少，高达2.8倍的端到端时间加速，以及相同的样本收敛速度(以及相同的微调任务精度)。此外，在ImageNet和CIFAR100等计算机视觉任务上，1位LAMB达到了与LAMB相同的精度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB’s Convergence Speed

To train large machine learning models (like BERT and GPT-3) on hundreds of GPUs, communication has become a significant bottleneck, especially on commodity systems with limited-bandwidth TCP networks. On one side, large batch-size optimization such as the LAMB algorithm was proposed to reduce the frequency of communication. On the other side, communication compression algorithms such as 1-bit Adam help to reduce the volume of each communication. However, we find that simply using one of the techniques is not sufficient to solve the communication challenge, especially under low network bandwidth. Motivated by this we aim to combine the power of large-batch optimization and communication compression but we find that existing compression strategies cannot be directly applied to LAMB due to its unique adaptive layerwise learning rates. To this end, we design a new communication-efficient optimization algorithm, 1-bit LAMB, which introduces a novel way to support adaptive layerwise learning rates under compression. In addition to the algorithm and corresponding theoretical analysis, we propose three novel system implementations in order to achieve actual wall clock speedup: a momentum fusion mechanism to reduce the number of communications, a momentum scaling technique to reduce compression error, and a NCCL-based compressed communication backend to improve both usability and performance. For the BERT-Large pre-training task with batch sizes from 8K to 64K, our evaluations on up to 256 GPUs demonstrate that our optimized implementation of 1-bit LAMB is able to achieve up to 4.6x communication volume reduction, up to 2.8x end-to-end time-wise speedup, and the same sample-wise convergence speed (and same fine-tuning task accuracy) compared to uncompressed LAMB. Furthermore, 1-bit LAMB achieves the same accuracy as LAMB on computer vision tasks like ImageNet and CIFAR100.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)

自引率

0.00%

发文量