Mo Sun, Zihan Yang, Changyue Liao, Yingtao Li, Fei Wu, Zeke Wang
{"title":"LuWu: An End-to-End In-Network Out-of-Core Optimizer for 100B-Scale Model-in-Network Data-Parallel Training on Distributed GPUs","authors":"Mo Sun, Zihan Yang, Changyue Liao, Yingtao Li, Fei Wu, Zeke Wang","doi":"arxiv-2409.00918","DOIUrl":null,"url":null,"abstract":"The recent progress made in large language models (LLMs) has brought\ntremendous application prospects to the world. The growing model size demands\nLLM training on multiple GPUs, while data parallelism is the most popular\ndistributed training strategy due to its simplicity, efficiency, and\nscalability. Current systems adopt the model-sharded data parallelism to enable\nmemory-efficient training, however, existing model-sharded data-parallel\nsystems fail to efficiently utilize GPU on a commodity GPU cluster with 100\nGbps (or 200 Gbps) inter-GPU bandwidth due to 1) severe interference between\ncollective operation and GPU computation and 2) heavy CPU optimizer overhead.\nRecent works propose in-network aggregation (INA) to relieve the network\nbandwidth pressure in data-parallel training, but they are incompatible with\nmodel sharding due to the network design. To this end, we propose LuWu, a novel\nin-network optimizer that enables efficient model-in-network data-parallel\ntraining of a 100B-scale model on distributed GPUs. Such new data-parallel\nparadigm keeps a similar communication pattern as model-sharded data\nparallelism but with a centralized in-network optimizer execution. The key idea\nis to offload the entire optimizer states and parameters from GPU workers onto\nan in-network optimizer node and to offload the entire collective communication\nfrom GPU-implemented NCCL to SmartNIC-SmartSwitch co-optimization. The\nexperimental results show that LuWu outperforms the state-of-the-art training\nsystem by 3.98x when training on a 175B model on an 8-worker cluster.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"71 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Distributed, Parallel, and Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.00918","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The recent progress made in large language models (LLMs) has brought
tremendous application prospects to the world. The growing model size demands
LLM training on multiple GPUs, while data parallelism is the most popular
distributed training strategy due to its simplicity, efficiency, and
scalability. Current systems adopt the model-sharded data parallelism to enable
memory-efficient training, however, existing model-sharded data-parallel
systems fail to efficiently utilize GPU on a commodity GPU cluster with 100
Gbps (or 200 Gbps) inter-GPU bandwidth due to 1) severe interference between
collective operation and GPU computation and 2) heavy CPU optimizer overhead.
Recent works propose in-network aggregation (INA) to relieve the network
bandwidth pressure in data-parallel training, but they are incompatible with
model sharding due to the network design. To this end, we propose LuWu, a novel
in-network optimizer that enables efficient model-in-network data-parallel
training of a 100B-scale model on distributed GPUs. Such new data-parallel
paradigm keeps a similar communication pattern as model-sharded data
parallelism but with a centralized in-network optimizer execution. The key idea
is to offload the entire optimizer states and parameters from GPU workers onto
an in-network optimizer node and to offload the entire collective communication
from GPU-implemented NCCL to SmartNIC-SmartSwitch co-optimization. The
experimental results show that LuWu outperforms the state-of-the-art training
system by 3.98x when training on a 175B model on an 8-worker cluster.