Conglong Li, A. Awan, Hanlin Tang, Samyam Rajbhandari, Yuxiong He
{"title":"1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB’s Convergence Speed","authors":"Conglong Li, A. Awan, Hanlin Tang, Samyam Rajbhandari, Yuxiong He","doi":"10.1109/HiPC56025.2022.00044","DOIUrl":null,"url":null,"abstract":"To train large machine learning models (like BERT and GPT-3) on hundreds of GPUs, communication has become a significant bottleneck, especially on commodity systems with limited-bandwidth TCP networks. On one side, large batch-size optimization such as the LAMB algorithm was proposed to reduce the frequency of communication. On the other side, communication compression algorithms such as 1-bit Adam help to reduce the volume of each communication. However, we find that simply using one of the techniques is not sufficient to solve the communication challenge, especially under low network bandwidth. Motivated by this we aim to combine the power of large-batch optimization and communication compression but we find that existing compression strategies cannot be directly applied to LAMB due to its unique adaptive layerwise learning rates. To this end, we design a new communication-efficient optimization algorithm, 1-bit LAMB, which introduces a novel way to support adaptive layerwise learning rates under compression. In addition to the algorithm and corresponding theoretical analysis, we propose three novel system implementations in order to achieve actual wall clock speedup: a momentum fusion mechanism to reduce the number of communications, a momentum scaling technique to reduce compression error, and a NCCL-based compressed communication backend to improve both usability and performance. For the BERT-Large pre-training task with batch sizes from 8K to 64K, our evaluations on up to 256 GPUs demonstrate that our optimized implementation of 1-bit LAMB is able to achieve up to 4.6x communication volume reduction, up to 2.8x end-to-end time-wise speedup, and the same sample-wise convergence speed (and same fine-tuning task accuracy) compared to uncompressed LAMB. Furthermore, 1-bit LAMB achieves the same accuracy as LAMB on computer vision tasks like ImageNet and CIFAR100.","PeriodicalId":119363,"journal":{"name":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HiPC56025.2022.00044","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 16
Abstract
To train large machine learning models (like BERT and GPT-3) on hundreds of GPUs, communication has become a significant bottleneck, especially on commodity systems with limited-bandwidth TCP networks. On one side, large batch-size optimization such as the LAMB algorithm was proposed to reduce the frequency of communication. On the other side, communication compression algorithms such as 1-bit Adam help to reduce the volume of each communication. However, we find that simply using one of the techniques is not sufficient to solve the communication challenge, especially under low network bandwidth. Motivated by this we aim to combine the power of large-batch optimization and communication compression but we find that existing compression strategies cannot be directly applied to LAMB due to its unique adaptive layerwise learning rates. To this end, we design a new communication-efficient optimization algorithm, 1-bit LAMB, which introduces a novel way to support adaptive layerwise learning rates under compression. In addition to the algorithm and corresponding theoretical analysis, we propose three novel system implementations in order to achieve actual wall clock speedup: a momentum fusion mechanism to reduce the number of communications, a momentum scaling technique to reduce compression error, and a NCCL-based compressed communication backend to improve both usability and performance. For the BERT-Large pre-training task with batch sizes from 8K to 64K, our evaluations on up to 256 GPUs demonstrate that our optimized implementation of 1-bit LAMB is able to achieve up to 4.6x communication volume reduction, up to 2.8x end-to-end time-wise speedup, and the same sample-wise convergence speed (and same fine-tuning task accuracy) compared to uncompressed LAMB. Furthermore, 1-bit LAMB achieves the same accuracy as LAMB on computer vision tasks like ImageNet and CIFAR100.