MPMoE: Memory Efficient MoE for Pre-Trained Models With Adaptive Pipeline Parallelism

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-04-08 DOI:10.1109/TPDS.2024.3385639

Zheng Zhang;Yaqi Xia;Hulin Wang;Donglin Yang;Chuang Hu;Xiaobo Zhou;Dazhao Cheng

{"title":"MPMoE: Memory Efficient MoE for Pre-Trained Models With Adaptive Pipeline Parallelism","authors":"Zheng Zhang;Yaqi Xia;Hulin Wang;Donglin Yang;Chuang Hu;Xiaobo Zhou;Dazhao Cheng","doi":"10.1109/TPDS.2024.3385639","DOIUrl":null,"url":null,"abstract":"In recent years, the Mixture-of-Experts (MoE) technique has gained widespread popularity as a means to scale pre-trained models to exceptionally large sizes. Dynamic activation of experts allows for conditional computation, increasing the number of parameters of neural networks, which is critical for absorbing the vast amounts of knowledge available in many deep learning areas. However, despite the existing system and algorithm optimizations, there are significant challenges to be tackled when it comes to the inefficiencies of communication and memory consumption. In this paper, we present the design and implementation of MPMoE, a high-performance library that accelerates MoE training with adaptive and memory-efficient pipeline parallelism. Inspired by that the MoE training procedure can be divided into multiple independent sub-stages. We design a pipeline parallelism method for reducing communication latency by overlapping with computation operations. Further, we analyze the memory footprint breakdown of MoE training and identify that activations and temporary buffers are the primary contributors to the overall memory footprint. Toward memory efficiency, we propose memory reuse strategies to reduce memory requirements by eliminating memory redundancies. Finally, to optimize pipeline granularity and memory reuse strategies jointly, we propose a profile-based algorithm and a performance model to determine the configurations of MPMoE at runtime. We implement MPMoE upon PyTorch and evaluate it with common MoE models in two physical clusters, including 64 NVIDIA A100 GPU cards and 16 NVIDIA V100 GPU cards. Compared with the state-of-art approach, MPMoE achieves up to 2.3× speedup while reducing more than 30% memory footprint for training large models.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 6","pages":"843-856"},"PeriodicalIF":5.6000,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Parallel and Distributed Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10494556/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

In recent years, the Mixture-of-Experts (MoE) technique has gained widespread popularity as a means to scale pre-trained models to exceptionally large sizes. Dynamic activation of experts allows for conditional computation, increasing the number of parameters of neural networks, which is critical for absorbing the vast amounts of knowledge available in many deep learning areas. However, despite the existing system and algorithm optimizations, there are significant challenges to be tackled when it comes to the inefficiencies of communication and memory consumption. In this paper, we present the design and implementation of MPMoE, a high-performance library that accelerates MoE training with adaptive and memory-efficient pipeline parallelism. Inspired by that the MoE training procedure can be divided into multiple independent sub-stages. We design a pipeline parallelism method for reducing communication latency by overlapping with computation operations. Further, we analyze the memory footprint breakdown of MoE training and identify that activations and temporary buffers are the primary contributors to the overall memory footprint. Toward memory efficiency, we propose memory reuse strategies to reduce memory requirements by eliminating memory redundancies. Finally, to optimize pipeline granularity and memory reuse strategies jointly, we propose a profile-based algorithm and a performance model to determine the configurations of MPMoE at runtime. We implement MPMoE upon PyTorch and evaluate it with common MoE models in two physical clusters, including 64 NVIDIA A100 GPU cards and 16 NVIDIA V100 GPU cards. Compared with the state-of-art approach, MPMoE achieves up to 2.3× speedup while reducing more than 30% memory footprint for training large models.

查看原文本刊更多论文

MPMoE：利用自适应管道并行性预训练模型的内存效率 MoE

近年来，专家混合（MoE）技术作为一种将预训练模型扩展到超大规模的手段，受到了广泛欢迎。专家的动态激活允许进行条件计算，增加了神经网络的参数数量，这对于吸收许多深度学习领域的大量知识至关重要。然而，尽管现有的系统和算法已经进行了优化，但在通信和内存消耗效率低下方面仍有重大挑战需要解决。在本文中，我们介绍了 MPMoE 的设计与实现，这是一个高性能库，可通过自适应和内存效率高的流水线并行来加速 MoE 训练。受此启发，MoE 训练过程可分为多个独立的子阶段。我们设计了一种流水线并行方法，通过与计算操作重叠来减少通信延迟。此外，我们分析了 MoE 训练的内存占用细分，发现激活和临时缓冲区是造成整体内存占用的主要因素。为了提高内存效率，我们提出了内存重用策略，通过消除内存冗余来降低内存需求。最后，为了联合优化流水线粒度和内存重用策略，我们提出了基于配置文件的算法和性能模型，以确定 MPMoE 在运行时的配置。我们在 PyTorch 上实现了 MPMoE，并在两个物理集群（包括 64 个英伟达 A100 GPU 卡和 16 个英伟达 V100 GPU 卡）中用常见的 MoE 模型对其进行了评估。与最先进的方法相比，MPMoE 的速度提高了 2.3 倍，同时在训练大型模型时减少了 30% 以上的内存占用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Parallel and Distributed Systems 工程技术-工程：电子与电气

CiteScore

11.00

自引率

9.40%

发文量

281

审稿时长

5.6 months

期刊介绍： IEEE Transactions on Parallel and Distributed Systems (TPDS) is published monthly. It publishes a range of papers, comments on previously published papers, and survey articles that deal with the parallel and distributed systems research areas of current importance to our readers. Particular areas of interest include, but are not limited to: a) Parallel and distributed algorithms, focusing on topics such as: models of computation; numerical, combinatorial, and data-intensive parallel algorithms, scalability of algorithms and data structures for parallel and distributed systems, communication and synchronization protocols, network algorithms, scheduling, and load balancing. b) Applications of parallel and distributed computing, including computational and data-enabled science and engineering, big data applications, parallel crowd sourcing, large-scale social network analysis, management of big data, cloud and grid computing, scientific and biomedical applications, mobile computing, and cyber-physical systems. c) Parallel and distributed architectures, including architectures for instruction-level and thread-level parallelism; design, analysis, implementation, fault resilience and performance measurements of multiple-processor systems; multicore processors, heterogeneous many-core systems; petascale and exascale systems designs; novel big data architectures; special purpose architectures, including graphics processors, signal processors, network processors, media accelerators, and other special purpose processors and accelerators; impact of technology on architecture; network and interconnect architectures; parallel I/O and storage systems; architecture of the memory hierarchy; power-efficient and green computing architectures; dependable architectures; and performance modeling and evaluation. d) Parallel and distributed software, including parallel and multicore programming languages and compilers, runtime systems, operating systems, Internet computing and web services, resource management including green computing, middleware for grids, clouds, and data centers, libraries, performance modeling and evaluation, parallel programming paradigms, and programming environments and tools.