LuWu: An End-to-End In-Network Out-of-Core Optimizer for 100B-Scale Model-in-Network Data-Parallel Training on Distributed GPUs

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-09-02 DOI:arxiv-2409.00918

Mo Sun, Zihan Yang, Changyue Liao, Yingtao Li, Fei Wu, Zeke Wang

{"title":"LuWu: An End-to-End In-Network Out-of-Core Optimizer for 100B-Scale Model-in-Network Data-Parallel Training on Distributed GPUs","authors":"Mo Sun, Zihan Yang, Changyue Liao, Yingtao Li, Fei Wu, Zeke Wang","doi":"arxiv-2409.00918","DOIUrl":null,"url":null,"abstract":"The recent progress made in large language models (LLMs) has brought\ntremendous application prospects to the world. The growing model size demands\nLLM training on multiple GPUs, while data parallelism is the most popular\ndistributed training strategy due to its simplicity, efficiency, and\nscalability. Current systems adopt the model-sharded data parallelism to enable\nmemory-efficient training, however, existing model-sharded data-parallel\nsystems fail to efficiently utilize GPU on a commodity GPU cluster with 100\nGbps (or 200 Gbps) inter-GPU bandwidth due to 1) severe interference between\ncollective operation and GPU computation and 2) heavy CPU optimizer overhead.\nRecent works propose in-network aggregation (INA) to relieve the network\nbandwidth pressure in data-parallel training, but they are incompatible with\nmodel sharding due to the network design. To this end, we propose LuWu, a novel\nin-network optimizer that enables efficient model-in-network data-parallel\ntraining of a 100B-scale model on distributed GPUs. Such new data-parallel\nparadigm keeps a similar communication pattern as model-sharded data\nparallelism but with a centralized in-network optimizer execution. The key idea\nis to offload the entire optimizer states and parameters from GPU workers onto\nan in-network optimizer node and to offload the entire collective communication\nfrom GPU-implemented NCCL to SmartNIC-SmartSwitch co-optimization. The\nexperimental results show that LuWu outperforms the state-of-the-art training\nsystem by 3.98x when training on a 175B model on an 8-worker cluster.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"71 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Distributed, Parallel, and Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.00918","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The recent progress made in large language models (LLMs) has brought tremendous application prospects to the world. The growing model size demands LLM training on multiple GPUs, while data parallelism is the most popular distributed training strategy due to its simplicity, efficiency, and scalability. Current systems adopt the model-sharded data parallelism to enable memory-efficient training, however, existing model-sharded data-parallel systems fail to efficiently utilize GPU on a commodity GPU cluster with 100 Gbps (or 200 Gbps) inter-GPU bandwidth due to 1) severe interference between collective operation and GPU computation and 2) heavy CPU optimizer overhead. Recent works propose in-network aggregation (INA) to relieve the network bandwidth pressure in data-parallel training, but they are incompatible with model sharding due to the network design. To this end, we propose LuWu, a novel in-network optimizer that enables efficient model-in-network data-parallel training of a 100B-scale model on distributed GPUs. Such new data-parallel paradigm keeps a similar communication pattern as model-sharded data parallelism but with a centralized in-network optimizer execution. The key idea is to offload the entire optimizer states and parameters from GPU workers onto an in-network optimizer node and to offload the entire collective communication from GPU-implemented NCCL to SmartNIC-SmartSwitch co-optimization. The experimental results show that LuWu outperforms the state-of-the-art training system by 3.98x when training on a 175B model on an 8-worker cluster.

查看原文本刊更多论文

LuWu：分布式 GPU 上用于 100B 级网络模型数据并行训练的端到端网络内核外优化器

大型语言模型（LLM）近年来取得的进展为世界带来了巨大的应用前景。模型规模的不断扩大要求在多个 GPU 上进行 LLM 训练，而数据并行性因其简单、高效和可扩展性而成为最受欢迎的分布式训练策略。然而，现有的模型分片数据并行系统无法在具有 100Gbps （或 200Gbps ）GPU 间带宽的商品 GPU 集群上有效利用 GPU，原因在于：1）集合操作和 GPU 计算之间存在严重干扰；2）CPU 优化器开销巨大。最近的研究提出了网络内聚合（in-network aggregation，INA）来缓解数据并行训练中的网络带宽压力，但由于网络设计的原因，它们与模型分片（model sharding）不兼容。为此，我们提出了一种新颖的网络内优化器 LuWu，它能在分布式 GPU 上对 100B 规模的模型进行高效的模型网络内数据并行训练。这种新的数据并行范式保持了与模型分片数据并行类似的通信模式，但采用了集中式网内优化器执行方式。其关键思路是将整个优化器的状态和参数从 GPU 工作站卸载到网内优化器节点上，并将整个集体通信从 GPU 实现的 NCCL 卸载到 SmartNIC-SmartSwitch 协同优化上。实验结果表明，在 8 个工作站集群上对一个 175B 的模型进行训练时，LuWu 的性能是最先进训练系统的 3.98 倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Distributed, Parallel, and Cluster Computing

自引率

0.00%

发文量