Lang Xu, Quentin Anthony, Qinghua Zhou, Nawras Alnaasan, Radha R. Gulhane, Aamir Shafi, Hari Subramoni, Dhabaleswar K. Panda
{"title":"Accelerating Large Language Model Training with Hybrid GPU-based Compression","authors":"Lang Xu, Quentin Anthony, Qinghua Zhou, Nawras Alnaasan, Radha R. Gulhane, Aamir Shafi, Hari Subramoni, Dhabaleswar K. Panda","doi":"arxiv-2409.02423","DOIUrl":null,"url":null,"abstract":"Data Parallelism (DP), Tensor Parallelism (TP), and Pipeline Parallelism (PP)\nare the three strategies widely adopted to enable fast and efficient Large\nLanguage Model (LLM) training. However, these approaches rely on data-intensive\ncommunication routines to collect, aggregate, and re-distribute gradients,\nactivations, and other important model information, which pose significant\noverhead. Co-designed with GPU-based compression libraries, MPI libraries have\nbeen proven to reduce message size significantly, and leverage interconnect\nbandwidth, thus increasing training efficiency while maintaining acceptable\naccuracy. In this work, we investigate the efficacy of compression-assisted MPI\ncollectives under the context of distributed LLM training using 3D parallelism\nand ZeRO optimizations. We scaled up to 192 V100 GPUs on the Lassen\nsupercomputer. First, we enabled a na\\\"ive compression scheme across all\ncollectives and observed a 22.5\\% increase in TFLOPS per GPU and a 23.6\\%\nincrease in samples per second for GPT-NeoX-20B training. Nonetheless, such a\nstrategy ignores the sparsity discrepancy among messages communicated in each\nparallelism degree, thus introducing more errors and causing degradation in\ntraining loss. Therefore, we incorporated hybrid compression settings toward\neach parallel dimension and adjusted the compression intensity accordingly.\nGiven their low-rank structure (arXiv:2301.02654), we apply aggressive\ncompression on gradients when performing DP All-reduce. We adopt milder\ncompression to preserve precision while communicating activations, optimizer\nstates, and model parameters in TP and PP. Using the adjusted hybrid\ncompression scheme, we demonstrate a 17.3\\% increase in TFLOPS per GPU and a\n12.7\\% increase in samples per second while reaching baseline loss convergence.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"268 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Distributed, Parallel, and Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.02423","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Data Parallelism (DP), Tensor Parallelism (TP), and Pipeline Parallelism (PP)
are the three strategies widely adopted to enable fast and efficient Large
Language Model (LLM) training. However, these approaches rely on data-intensive
communication routines to collect, aggregate, and re-distribute gradients,
activations, and other important model information, which pose significant
overhead. Co-designed with GPU-based compression libraries, MPI libraries have
been proven to reduce message size significantly, and leverage interconnect
bandwidth, thus increasing training efficiency while maintaining acceptable
accuracy. In this work, we investigate the efficacy of compression-assisted MPI
collectives under the context of distributed LLM training using 3D parallelism
and ZeRO optimizations. We scaled up to 192 V100 GPUs on the Lassen
supercomputer. First, we enabled a na\"ive compression scheme across all
collectives and observed a 22.5\% increase in TFLOPS per GPU and a 23.6\%
increase in samples per second for GPT-NeoX-20B training. Nonetheless, such a
strategy ignores the sparsity discrepancy among messages communicated in each
parallelism degree, thus introducing more errors and causing degradation in
training loss. Therefore, we incorporated hybrid compression settings toward
each parallel dimension and adjusted the compression intensity accordingly.
Given their low-rank structure (arXiv:2301.02654), we apply aggressive
compression on gradients when performing DP All-reduce. We adopt milder
compression to preserve precision while communicating activations, optimizer
states, and model parameters in TP and PP. Using the adjusted hybrid
compression scheme, we demonstrate a 17.3\% increase in TFLOPS per GPU and a
12.7\% increase in samples per second while reaching baseline loss convergence.