IEEE Transactions on Parallel and Distributed Systems最新文献

筛选
英文 中文
CAT: Cellular Automata on Tensor Cores 张量核上的元胞自动机
IF 5.6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-12-20 DOI: 10.1109/TPDS.2024.3520395
Cristóbal A. Navarro;Felipe A. Quezada;Enzo Meneses;Héctor Ferrada;Nancy Hitschfeld
{"title":"CAT: Cellular Automata on Tensor Cores","authors":"Cristóbal A. Navarro;Felipe A. Quezada;Enzo Meneses;Héctor Ferrada;Nancy Hitschfeld","doi":"10.1109/TPDS.2024.3520395","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3520395","url":null,"abstract":"Cellular automata (CA) are simulation models that can produce complex emergent behaviors from simple local rules. Although state-of-the-art GPU solutions are already fast due to their data-parallel nature, their performance can rapidly degrade in CA with a large neighborhood radius. With the inclusion of tensor cores across the entire GPU ecosystem, interest has grown in finding ways to leverage these fast units outside the field of artificial intelligence, which was their original purpose. In this work, we present CAT, a GPU tensor core approach that can accelerate CA in which the cell transition function acts on a weighted summation of its neighborhood. CAT is evaluated theoretically, using an extended PRAM cost model, as well as empirically using the Larger Than Life (LTL) family of CA as case studies. The results confirm that the cost model is accurate, showing that CAT exhibits constant time throughout the entire radius range \u0000<inline-formula><tex-math>$1 leq r leq 16$</tex-math></inline-formula>\u0000, and its theoretical speedups agree with the empirical results. At low radius \u0000<inline-formula><tex-math>$r=1,2$</tex-math></inline-formula>\u0000, CAT is competitive and is only surpassed by the fastest state-of-the-art GPU solution. Starting from \u0000<inline-formula><tex-math>$r=3$</tex-math></inline-formula>\u0000, CAT progressively outperforms all other approaches, reaching speedups of up to \u0000<inline-formula><tex-math>$101times$</tex-math></inline-formula>\u0000 over a GPU baseline and up to \u0000<inline-formula><tex-math>$sim !14times$</tex-math></inline-formula>\u0000 over the fastest state-of-the-art GPU approach. In terms of energy efficiency, CAT is competitive in the range \u0000<inline-formula><tex-math>$1 leq r leq 4$</tex-math></inline-formula>\u0000 and from \u0000<inline-formula><tex-math>$r geq 5$</tex-math></inline-formula>\u0000 it is the most energy efficient approach. As for performance scaling across GPU architectures, CAT shows a promising trend that, if continues for future generations, it would increase its performance at a higher rate than classical GPU solutions. A CPU version of CAT was also explored, using the recently introduced AMX instructions. Although its performance is still below GPU tensor cores, it is a promising approach as it can still outperform some GPU approaches at large radius. The results obtained in this work put CAT as an approach with great potential for scientists who need to study emerging phenomena in CA with a large neighborhood radius, both in the GPU and in the CPU.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 2","pages":"341-355"},"PeriodicalIF":5.6,"publicationDate":"2024-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142938327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AsyncFedGAN: An Efficient and Staleness-Aware Asynchronous Federated Learning Framework for Generative Adversarial Networks
IF 5.6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-12-20 DOI: 10.1109/TPDS.2024.3521016
Daniel Manu;Abee Alazzwi;Jingjing Yao;Youzuo Lin;Xiang Sun
{"title":"AsyncFedGAN: An Efficient and Staleness-Aware Asynchronous Federated Learning Framework for Generative Adversarial Networks","authors":"Daniel Manu;Abee Alazzwi;Jingjing Yao;Youzuo Lin;Xiang Sun","doi":"10.1109/TPDS.2024.3521016","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3521016","url":null,"abstract":"Generative Adversarial Networks (GANs) are deep learning models that learn and generate new samples similar to existing ones. Traditionally, GANs are trained in centralized data centers, raising data privacy concerns due to the need for clients to upload their data. To address this, Federated Learning (FL) integrates with GANs, allowing collaborative training without sharing local data. However, this integration is complex because GANs involve two interdependent models—the generator and the discriminator—while FL typically handles a single model over distributed datasets. In this article, we propose a novel asynchronous FL framework for GANs, called AsyncFedGAN, designed to efficiently and distributively train both models tailored for molecule generation. AsyncFedGAN addresses the challenges of training interactive models, resolves the straggler issue in synchronous FL, reduces model staleness in asynchronous FL, and lowers client energy consumption. Our extensive simulations for molecular discovery show that AsyncFedGAN achieves convergence with proper settings, outperforms baseline methods, and balances model performance with client energy usage.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 3","pages":"553-569"},"PeriodicalIF":5.6,"publicationDate":"2024-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143361038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
UMPIPE: Unequal Microbatches-Based Pipeline Parallelism for Deep Neural Network Training UMPIPE:基于不等微批的管道并行深度神经网络训练
IF 5.6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-12-11 DOI: 10.1109/TPDS.2024.3515804
Guangyao Zhou;Wenhong Tian;Rajkumar Buyya;Kui Wu
{"title":"UMPIPE: Unequal Microbatches-Based Pipeline Parallelism for Deep Neural Network Training","authors":"Guangyao Zhou;Wenhong Tian;Rajkumar Buyya;Kui Wu","doi":"10.1109/TPDS.2024.3515804","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3515804","url":null,"abstract":"The increasing need for large-scale deep neural networks (DNN) has made parallel training an area of intensive focus. One effective method, microbatch-based pipeline parallelism (notably GPipe), accelerates parallel training in various architectures. However, existing parallel training architectures normally use equal data partitioning (EDP), where each layer's process maintains identical microbatch-sizes. EDP may hinder training speed because different processes often require varying optimal microbatch-sizes. To address this, we introduce UMPIPE, a novel framework for unequal microbatches-based pipeline parallelism. UMPIPE enables unequal data partitions (UEDP) across processes to optimize resource utilization. We develop a recurrence formula to calculate the time cost in UMPIPE by considering both computation and communication processes. To further enhance UMPIPE's efficiency, we propose the Dual-Chromosome Genetic Algorithm for UMPIPE (DGAP) that accounts for the independent time costs of forward and backward propagation. Furthermore, we present TiDGAP, a two-level improvement on DGAP. TiDGAP accelerates the process by simultaneously calculating the end time for multiple individuals and microbatches using matrix operations. Our extensive experiments validate the dual-chromosome strategy's optimization benefits and TiDGAP's acceleration capabilities. TiDGAP can achieve better training schemes than baselines, such as the local greedy algorithm and the global greedy-based dynamic programming. Compared to (GPipe, PipeDream), UMPIPE achieves increases in training speed: \u0000<inline-formula><tex-math>$(13.89,11.09)%$</tex-math></inline-formula>\u0000 for GPT1-14, \u0000<inline-formula><tex-math>$(17.11, 7.96)%$</tex-math></inline-formula>\u0000 for VGG16 and \u0000<inline-formula><tex-math>$geq (170,100)%$</tex-math></inline-formula>\u0000 for simulation networks.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 2","pages":"293-307"},"PeriodicalIF":5.6,"publicationDate":"2024-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142890166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fine-Grained QoS Control via Tightly-Coupled Bandwidth Monitoring and Regulation for FPGA-Based Heterogeneous SoCs 基于fpga的异构soc紧耦合带宽监控与调节的细粒度QoS控制
IF 5.6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-12-09 DOI: 10.1109/TPDS.2024.3513416
Giacomo Valente;Gianluca Brilli;Tania Di Mascio;Alessandro Capotondi;Paolo Burgio;Paolo Valente;Andrea Marongiu
{"title":"Fine-Grained QoS Control via Tightly-Coupled Bandwidth Monitoring and Regulation for FPGA-Based Heterogeneous SoCs","authors":"Giacomo Valente;Gianluca Brilli;Tania Di Mascio;Alessandro Capotondi;Paolo Burgio;Paolo Valente;Andrea Marongiu","doi":"10.1109/TPDS.2024.3513416","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3513416","url":null,"abstract":"Commercial embedded systems increasingly rely on heterogeneous architectures that integrate general-purpose, multi-core processors, and various hardware accelerators on the same chip. This provides the high performance required by modern applications at a low cost and low power consumption, but at the same time poses new challenges. Hardware resource sharing at various levels, and in particular at the main memory controller level, results in slower execution time for the application tasks, ultimately making the system unpredictable from the point of view of timing. To enable the adoption of heterogeneous systems-on-chip (System on Chips (SoCs)) in the domain of timing-critical applications several hardware and software approaches have been proposed, bandwidth regulation based on monitoring and throttling being one of the most widely adopted. Existing solutions, however, are either too coarse-grained, limiting the control over computing engines activities, or strongly platform-dependent, addressing the problem only for specific SoCs. This article proposes an innovative approach that can accurately control main memory bandwidth usage in FPGA-based heterogeneous SoCs. In particular, it controls system bandwidth by connecting a runtime bandwidth regulation component to FPGA-based accelerators. Our solution offers dynamically configurable, fine-grained bandwidth regulation – to adapt to the varying requirements of the application over time – at a very low overhead. Furthermore, it is entirely platform-independent, capable of integration with any FPGA-based accelerator. Developed at the register-transfer level using a reference SoC platform, it is designed for easy compatibility with any FPGA-based SoC. Experimental results conducted on the Xilinx Zynq UltraScale+ platform demonstrate that our approach (i) is more than \u0000<inline-formula><tex-math>$100times$</tex-math></inline-formula>\u0000 faster than loosely-coupled, software controlled regulators; (ii) is capable of exploiting the system bandwidth 28.7% more efficiently than tightly-coupled hardware regulators (e.g., ARM CoreLink QoS-400, where available); (iii) enables task co-scheduling solutions not feasible with state-of-the-art bandwidth regulation methods.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 2","pages":"326-340"},"PeriodicalIF":5.6,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142938328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TOP: Task-Based Operator Parallelism for Asynchronous Deep Learning Inference on GPU TOP: GPU上异步深度学习推理的基于任务的算子并行性
IF 5.6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-12-05 DOI: 10.1109/TPDS.2024.3511543
Changyao Lin;Zhenming Chen;Ziyang Zhang;Jie Liu
{"title":"TOP: Task-Based Operator Parallelism for Asynchronous Deep Learning Inference on GPU","authors":"Changyao Lin;Zhenming Chen;Ziyang Zhang;Jie Liu","doi":"10.1109/TPDS.2024.3511543","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3511543","url":null,"abstract":"Current deep learning compilers have made significant strides in optimizing computation graphs for single- and multi-model scenarios. However, they lack specific optimizations for asynchronous multi-task inference systems. In such systems, tasks arrive dynamically, leading to diverse inference progress for each model. This renders traditional optimization strategies based solely on the original computation graph suboptimal or even invalid. Furthermore, existing operator scheduling methods do not account for parallel task pipelines involving the same model. Task pipelines present additional opportunities for optimization. Therefore, we propose Task-based Operator Parallelism (TOP). TOP incorporates an understanding of the impact of task arrival patterns on the inference progress of each model. It leverages the multi-agent reinforcement learning algorithm MADDPG to cooperatively optimize the task launcher and model scheduler, generating an optimal pair of dequeue frequency and computation graph. The objective of TOP is to enhance resource utilization, increase throughput, and allocate resources judiciously to prevent task backlog. To expedite the optimization process in TOP, we introduce a novel stage partition method using the GNN-based Policy Gradient (GPG) algorithm. Through extensive experiments on various devices, we demonstrate the efficacy of TOP. It outperforms the state-of-the-art in operator scheduling for both single- and multi-model task processing scenarios. Benefiting from TOP, we can significantly enhance the throughput of a single model by increasing its concurrency or batch size, thereby achieving self-acceleration.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 2","pages":"266-281"},"PeriodicalIF":5.6,"publicationDate":"2024-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142890355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Efficient GPU Algorithm for Lattice Boltzmann Method on Sparse Complex Geometries 稀疏复几何格子玻尔兹曼方法的高效GPU算法
IF 5.6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-12-04 DOI: 10.1109/TPDS.2024.3510810
Zhangrong Qin;Xusheng Lu;Long Lv;Zhongxiang Tang;Binghai Wen
{"title":"An Efficient GPU Algorithm for Lattice Boltzmann Method on Sparse Complex Geometries","authors":"Zhangrong Qin;Xusheng Lu;Long Lv;Zhongxiang Tang;Binghai Wen","doi":"10.1109/TPDS.2024.3510810","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3510810","url":null,"abstract":"Many fluid flow problems, such as the porous media, arterial blood flow and tissue fluid, contain sparse complex geometries. Although the lattice Boltzmann method is good at dealing with the complex boundaries, these sparse complex geometries cause the low computational performance and high memory consumption when the graphics processing unit (GPU) is used to accelerate the numerical computation. These problems would be addressed by compact memory layout, sophisticated memory access and enhanced thread utilization. This paper proposes a GPU-based algorithm to improve the lattice Boltzmann simulations with sparse complex geometries. An access pattern for a single set of distribution functions together with a semi-direct addressing is adopted to reduce memory consumption, while a collected structure of arrays is employed to enhance memory access efficiency. Furthermore, an address index array and a node classification coding scheme are employed to improve the GPU thread utilization ratio and reduce the GPU global memory access, respectively. The accuracy and mesh-independence has been verified by the numerical simulations of Poiseuille flow and porous media flow with face-centered filled spheres. The present algorithm has a significantly lower memory consumption than those based on direct or indirect addressing schemes. It improves the computational performance by several times compared to the other algorithms on the common GPU hardware.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 2","pages":"239-252"},"PeriodicalIF":5.6,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142890357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Object Proxy Patterns for Accelerating Distributed Applications 加速分布式应用程序的对象代理模式
IF 5.6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-12-04 DOI: 10.1109/TPDS.2024.3511347
J. Gregory Pauloski;Valerie Hayot-Sasson;Logan Ward;Alexander Brace;André Bauer;Kyle Chard;Ian Foster
{"title":"Object Proxy Patterns for Accelerating Distributed Applications","authors":"J. Gregory Pauloski;Valerie Hayot-Sasson;Logan Ward;Alexander Brace;André Bauer;Kyle Chard;Ian Foster","doi":"10.1109/TPDS.2024.3511347","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3511347","url":null,"abstract":"Workflow and serverless frameworks have empowered new approaches to distributed application design by abstracting compute resources. However, their typically limited or one-size-fits-all support for advanced data flow patterns leaves optimization to the application programmer—optimization that becomes more difficult as data become larger. The transparent object proxy, which provides wide-area references that can resolve to data regardless of location, has been demonstrated as an effective low-level building block in such situations. Here we propose three high-level proxy-based programming patterns—distributed futures, streaming, and ownership—that make the power of the proxy pattern usable for more complex and dynamic distributed program structures. We motivate these patterns via careful review of application requirements and describe implementations of each pattern. We evaluate our implementations through a suite of benchmarks and by applying them in three meaningful scientific applications, in which we demonstrate substantial improvements in runtime, throughput, and memory usage.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 2","pages":"253-265"},"PeriodicalIF":5.6,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142890356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms 面向多gpu平台的机器学习训练通用性能建模
IF 5.6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-11-28 DOI: 10.1109/TPDS.2024.3507814
Zhongyi Lin;Ning Sun;Pallab Bhattacharya;Xizhou Feng;Louis Feng;John D. Owens
{"title":"Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms","authors":"Zhongyi Lin;Ning Sun;Pallab Bhattacharya;Xizhou Feng;Louis Feng;John D. Owens","doi":"10.1109/TPDS.2024.3507814","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3507814","url":null,"abstract":"Characterizing and predicting the training performance of modern machine learning (ML) workloads on compute systems with compute and communication spread between CPUs, GPUs, and network devices is not only the key to optimization and planning but also a complex goal to achieve. The primary challenges include the complexity of synchronization and load balancing between CPUs and GPUs, the variance in input data distribution, and the use of different communication devices and topologies (e.g., NVLink, PCIe, network cards) that connect multiple compute devices, coupled with the desire for flexible training configurations. Built on top of our prior work for single-GPU platforms, we address these challenges and enable multi-GPU performance modeling\u0000<sup>1</sup>\u0000 by incorporating (1) data-distribution-aware performance models for embedding table lookup, and (2) data movement prediction of communication collectives, into our upgraded performance modeling pipeline equipped with inter-and intra-rank synchronization for ML workloads trained on multi-GPU platforms. Beyond accurately predicting the per-iteration training time of deep learning recommendation models (DLRM) models with random configurations with a geomean error of 5.21% on two multi-GPU platforms, our prediction pipeline generalizes well to other types of ML workloads, such as Transformer-based natural language processing (NLP) models with a geomean error of 3.00%. Moreover, even without actually running ML workloads like DLRMs on the hardware, it is capable of generating insights such as quickly selecting the fastest embedding table sharding configuration (with a success rate of 85%).","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 2","pages":"226-238"},"PeriodicalIF":5.6,"publicationDate":"2024-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142890381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Slark: A Performance Robust Decentralized Inter-Datacenter Deadline-Aware Coflows Scheduling Framework With Local Information 基于本地信息的分布式数据中心间截止日期感知协同流调度框架
IF 5.6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-11-28 DOI: 10.1109/TPDS.2024.3508275
Xiaodong Dong;Lihai Nie;Zheli Liu;Yang Xiang
{"title":"Slark: A Performance Robust Decentralized Inter-Datacenter Deadline-Aware Coflows Scheduling Framework With Local Information","authors":"Xiaodong Dong;Lihai Nie;Zheli Liu;Yang Xiang","doi":"10.1109/TPDS.2024.3508275","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3508275","url":null,"abstract":"Inter-datacenter network applications generate massive coflows for purposes, e.g., backup, synchronization, and analytics, with deadline requirements. Decentralized coflow scheduling frameworks are desirable for their scalability in cross-domain deployment but grappling with the challenge of information agnosticism for lack of cross-domain privileges. Current information-agnostic coflow scheduling methods are incompatible with decentralized frameworks for relying on centralized controllers to continuously monitor and learn from coflow global transmission states to infer global coflow information. Alternative methods propose mechanisms for decentralized global coflow information gathering and synchronization. However, they require dedicated physical hardware or control logic, which could be impractical for incremental deployment. This article proposes Slark, a decentralized deadline-aware coflow scheduling framework, which meets coflows’ soft and hard deadline requirements using only local traffic information. It eschews requiring global coflow transmission states and dedicated hardware or control logic by leveraging multiple software-implemented scheduling agents working independently on each node and integrating such information agnosticism into node-specific bandwidth allocation by modeling it as a robust optimization problem with flow information on the other nodes represented as uncertain parameters. Subsequently, we validate the performance robustness of Slark by investigating how perturbations in the optimal objective function value and the associated optimal solution are affected by uncertain parameters. Finally, we propose a firebug-swarm-optimization-based heuristic algorithm to tackle the non-convexity in our problem. Experimental results demonstrate that Slark can significantly enhance transmission revenue and increase soft and hard deadline guarantee ratios by 10.52% and 7.99% on average.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 2","pages":"197-211"},"PeriodicalIF":5.6,"publicationDate":"2024-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Joint Dynamic Data and Model Parallelism for Distributed Training of DNNs Over Heterogeneous Infrastructure 基于异构基础设施的dnn分布式训练联合动态数据和模型并行性
IF 5.6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-11-27 DOI: 10.1109/TPDS.2024.3506588
Zhi Ling;Xiaofeng Jiang;Xiaobin Tan;Huasen He;Shiyin Zhu;Jian Yang
{"title":"Joint Dynamic Data and Model Parallelism for Distributed Training of DNNs Over Heterogeneous Infrastructure","authors":"Zhi Ling;Xiaofeng Jiang;Xiaobin Tan;Huasen He;Shiyin Zhu;Jian Yang","doi":"10.1109/TPDS.2024.3506588","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3506588","url":null,"abstract":"Distributed training of deep neural networks (DNNs) suffers from efficiency declines in dynamic heterogeneous environments, due to the resource wastage brought by the straggler problem in data parallelism (DP) and pipeline bubbles in model parallelism (MP). Additionally, the limited resource availability requires a trade-off between training performance and long-term costs, particularly in online settings. To address these challenges, this article presents a novel online approach to maximize long-term training efficiency in heterogeneous environments through uneven data assignment and communication-aware model partitioning. A group-based hierarchical architecture combining DP and MP is developed to balance discrepant computation and communication capabilities, and offer a flexible parallel mechanism. In order to jointly optimize the performance and long-term cost of the online DL training process, we formulate this problem as a stochastic optimization with time-averaged constraints. By utilizing Lyapunov’s stochastic network optimization theory, we decompose it into several instantaneous sub-optimizations, and devise an effective online solution to address them based on tentative searching and linear solving. We have implemented a prototype system and evaluated the effectiveness of our solution based on realistic experiments, reducing batch training time by up to 68.59% over state-of-the-art methods.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 2","pages":"150-167"},"PeriodicalIF":5.6,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信