Accelerating Distributed Training of Large Concurrent-Branch Models Through Bidirectional Pipeline Coordination

IF 3.5 1区计算机科学 Q1 Multidisciplinary

Tsinghua Science and Technology Pub Date : 2025-07-04 DOI:10.26599/TST.2024.9010233

Zan Zong;Yuyang Chen;Qi Zhang;Daming Zhao;Jianjiang Li;Yijun Jing;Jidong Zhai

{"title":"Accelerating Distributed Training of Large Concurrent-Branch Models Through Bidirectional Pipeline Coordination","authors":"Zan Zong;Yuyang Chen;Qi Zhang;Daming Zhao;Jianjiang Li;Yijun Jing;Jidong Zhai","doi":"10.26599/TST.2024.9010233","DOIUrl":null,"url":null,"abstract":"Large models have been widely used in the field of neural language processing, information retrieving, etc. With the development of the large models, not only is the parameter scale increased, but the model architecture has also become more complex. For example, the multi-modal transformer-based model mainly has concurrent branches, which we denoted as the concurrent branch model (CBM). Many CBMs have enlarged to tens of billions of parameters, and require distributed resources to train this kind of model. Existing distributed training systems cannot fully handle this type of model architecture because there are interactions between branches. Inspired by the unbalanced resource usage of pipeline parallelism, we prefer to organize different branches with a fine-grained bidirectional pipeline schedule of communication and computation. However, improper coordination between branches leads to idle time for computation and low training efficiency. In this paper, we present Flexpipe, a pipeline engine for c3oncurrent-branch models. We first introduce a branch-aware pipeline parallelism (BAPP) to make full use of the concurrent characteristic of the model architecture. Then, based on a multi-branch pipeline simulator, we propose an adaptive interaction coordinator, which facilitates the low-overhead branch interactions during the distributed model training. We evaluate our approach on popular concurrent branch models combined with modern training systems. Compared with the Chimera, the experiential results show that our method improves the end-to-end training throughput by 20% on average.","PeriodicalId":48690,"journal":{"name":"Tsinghua Science and Technology","volume":"30 6","pages":"2638-2652"},"PeriodicalIF":3.5000,"publicationDate":"2025-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11072115","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Tsinghua Science and Technology","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11072115/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Multidisciplinary","Score":null,"Total":0}

引用次数: 0

Abstract

Large models have been widely used in the field of neural language processing, information retrieving, etc. With the development of the large models, not only is the parameter scale increased, but the model architecture has also become more complex. For example, the multi-modal transformer-based model mainly has concurrent branches, which we denoted as the concurrent branch model (CBM). Many CBMs have enlarged to tens of billions of parameters, and require distributed resources to train this kind of model. Existing distributed training systems cannot fully handle this type of model architecture because there are interactions between branches. Inspired by the unbalanced resource usage of pipeline parallelism, we prefer to organize different branches with a fine-grained bidirectional pipeline schedule of communication and computation. However, improper coordination between branches leads to idle time for computation and low training efficiency. In this paper, we present Flexpipe, a pipeline engine for c3oncurrent-branch models. We first introduce a branch-aware pipeline parallelism (BAPP) to make full use of the concurrent characteristic of the model architecture. Then, based on a multi-branch pipeline simulator, we propose an adaptive interaction coordinator, which facilitates the low-overhead branch interactions during the distributed model training. We evaluate our approach on popular concurrent branch models combined with modern training systems. Compared with the Chimera, the experiential results show that our method improves the end-to-end training throughput by 20% on average.

查看原文本刊更多论文

通过双向管道协调加速大型并发分支模型的分布式训练

大型模型在神经语言处理、信息检索等领域得到了广泛的应用。随着大型模型的发展，不仅参数尺度增大，而且模型体系结构也变得更加复杂。例如，基于多模态变压器的模型主要有并发分支，我们称之为并发分支模型（CBM）。许多cbm已经扩大到数百亿个参数，并且需要分布式资源来训练这类模型。现有的分布式训练系统不能完全处理这种类型的模型体系结构，因为分支之间存在交互。受管道并行的资源使用不平衡的启发，我们倾向于用细粒度的通信和计算的双向管道调度来组织不同的分支。但由于分支间协调不协调，导致计算时间空闲，训练效率低下。在本文中，我们提出了Flexpipe，一个管道引擎，为c3current -branch模型。首先引入分支感知管道并行（BAPP），充分利用模型体系结构的并发特性。然后，在多分支管道模拟器的基础上，提出了一种自适应交互协调器，实现了分布式模型训练过程中低开销的分支交互。我们在流行的并发分支模型和现代培训系统上对我们的方法进行了评估。实验结果表明，与Chimera相比，我们的方法将端到端训练吞吐量平均提高了20%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Tsinghua Science and Technology COMPUTER SCIENCE, INFORMATION SYSTEMSCOMPU-COMPUTER SCIENCE, SOFTWARE ENGINEERING

CiteScore

10.20

自引率

10.60%

发文量

2340

期刊介绍： Tsinghua Science and Technology (Tsinghua Sci Technol) started publication in 1996. It is an international academic journal sponsored by Tsinghua University and is published bimonthly. This journal aims at presenting the up-to-date scientific achievements in computer science, electronic engineering, and other IT fields. Contributions all over the world are welcome.