Guangyao Zhou;Yiqin Fu;Haocheng Lan;Yuanlun Xie;Wenhong Tian;Rajkumar Buyya;Jianhong Qian;Teng Su
{"title":"Cross-Search With Improved Multi-Dimensional Dichotomy-Based Joint Optimization for Distributed Parallel Training of DNN","authors":"Guangyao Zhou;Yiqin Fu;Haocheng Lan;Yuanlun Xie;Wenhong Tian;Rajkumar Buyya;Jianhong Qian;Teng Su","doi":"10.1109/TPDS.2025.3580098","DOIUrl":null,"url":null,"abstract":"Distributed parallel training of large-scale deep neural networks (DNN) has attracted the attentions of both artificial intelligence and high-performance distributed computing. One of efficient approaches is the micro-batch-based pipeline parallelism (MBPP), e.g., GPipe and Terapipe. Based on the MBPP, we establish a time-cost model with the basic time function of layers, which considers computing time and communication time simultaneously as well as considers they are nonlinear with the amount of input data. Focusing on the jointly optimal solutions of network division and data partition, we propose a Cross-Search algorithm with Improved Multi-dimensional Dichotomy (CSIMD). Through theoretical derivation, we prove improved multi-dimensional dichotomy (IMD) has appreciable theoretical optimality and linear computational complexity significantly faster than the state-of-the-art methods including dynamic programming and recursive algorithm. Extensive experiments on both CNN-based and transformer-based neural networks demonstrate our proposed CSIMD can obtain optimal network division and data partition schemes under MBPP. On average, the training speeds of CSIMD in CNN- and transformer-based DNNs are respectively <inline-formula><tex-math>$(2.0, 2.5)\\times$</tex-math></inline-formula> and <inline-formula><tex-math>$(2.66, 5.48)\\times$</tex-math></inline-formula> of (MBPP-R, MBPP-E).","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 8","pages":"1680-1694"},"PeriodicalIF":5.6000,"publicationDate":"2025-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Parallel and Distributed Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11037501/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
Distributed parallel training of large-scale deep neural networks (DNN) has attracted the attentions of both artificial intelligence and high-performance distributed computing. One of efficient approaches is the micro-batch-based pipeline parallelism (MBPP), e.g., GPipe and Terapipe. Based on the MBPP, we establish a time-cost model with the basic time function of layers, which considers computing time and communication time simultaneously as well as considers they are nonlinear with the amount of input data. Focusing on the jointly optimal solutions of network division and data partition, we propose a Cross-Search algorithm with Improved Multi-dimensional Dichotomy (CSIMD). Through theoretical derivation, we prove improved multi-dimensional dichotomy (IMD) has appreciable theoretical optimality and linear computational complexity significantly faster than the state-of-the-art methods including dynamic programming and recursive algorithm. Extensive experiments on both CNN-based and transformer-based neural networks demonstrate our proposed CSIMD can obtain optimal network division and data partition schemes under MBPP. On average, the training speeds of CSIMD in CNN- and transformer-based DNNs are respectively $(2.0, 2.5)\times$ and $(2.66, 5.48)\times$ of (MBPP-R, MBPP-E).
大规模深度神经网络(DNN)的分布式并行训练已经引起了人工智能和高性能分布式计算领域的广泛关注。其中一种有效的方法是基于微批处理的管道并行(MBPP),例如GPipe和Terapipe。在MBPP的基础上,建立了具有层的基本时间函数的时间成本模型,该模型同时考虑了计算时间和通信时间,并考虑了它们随输入数据量的非线性关系。针对网络划分和数据划分的联合最优解,提出了一种改进多维二分类交叉搜索算法(CSIMD)。通过理论推导,我们证明了改进的多维二分法(IMD)具有明显的理论最优性和线性计算复杂度,显著快于目前最先进的方法,包括动态规划和递归算法。在基于cnn和基于变压器的神经网络上的大量实验表明,我们提出的CSIMD可以在MBPP下获得最优的网络划分和数据分区方案。平均而言,基于CNN和变压器的dnn的CSIMD训练速度分别为$(2.0,2.5)\times$和$(2.66,5.48)\times$ of (MBPP-R, MBPP-E)。
期刊介绍:
IEEE Transactions on Parallel and Distributed Systems (TPDS) is published monthly. It publishes a range of papers, comments on previously published papers, and survey articles that deal with the parallel and distributed systems research areas of current importance to our readers. Particular areas of interest include, but are not limited to:
a) Parallel and distributed algorithms, focusing on topics such as: models of computation; numerical, combinatorial, and data-intensive parallel algorithms, scalability of algorithms and data structures for parallel and distributed systems, communication and synchronization protocols, network algorithms, scheduling, and load balancing.
b) Applications of parallel and distributed computing, including computational and data-enabled science and engineering, big data applications, parallel crowd sourcing, large-scale social network analysis, management of big data, cloud and grid computing, scientific and biomedical applications, mobile computing, and cyber-physical systems.
c) Parallel and distributed architectures, including architectures for instruction-level and thread-level parallelism; design, analysis, implementation, fault resilience and performance measurements of multiple-processor systems; multicore processors, heterogeneous many-core systems; petascale and exascale systems designs; novel big data architectures; special purpose architectures, including graphics processors, signal processors, network processors, media accelerators, and other special purpose processors and accelerators; impact of technology on architecture; network and interconnect architectures; parallel I/O and storage systems; architecture of the memory hierarchy; power-efficient and green computing architectures; dependable architectures; and performance modeling and evaluation.
d) Parallel and distributed software, including parallel and multicore programming languages and compilers, runtime systems, operating systems, Internet computing and web services, resource management including green computing, middleware for grids, clouds, and data centers, libraries, performance modeling and evaluation, parallel programming paradigms, and programming environments and tools.