{"title":"Deadline-Aware Online Job Scheduling for Distributed Training in Heterogeneous Clusters","authors":"Yuchen Zhang;Long Luo;Gang Sun;Hongfang Yu;Bo Li","doi":"10.1109/TCC.2025.3548604","DOIUrl":null,"url":null,"abstract":"The explosive growth in training data and model sizes has spurred the adoption of distributed deep learning (DL) in heterogeneous computing clusters. Efficiently scheduling distributed training jobs in such heterogeneous environments while ensuring they meet user-specified deadlines remains a critical challenge. While most existing works focus on reducing job completion time in homogeneous clusters, they pay little attention to meeting job deadlines in heterogeneous clusters. To address this issue, we propose <sc>Dancer</small> (Deadline-Aware dyNamiC GPU allocation approach for Efficient Resource utilization), a novel framework that dynamically adjusts not only the number but the type of GPUs assigned to each job throughout its training lifecycle. <sc>Dancer</small> aims to maximize the number of jobs meeting their deadlines in heterogeneous GPU clusters. It decouples job placement from resource allocation and formulates the scheduling optimization problem for maximizing the number of deadline-meeting jobs as an Integer Linear Programming (ILP) problem. To solve this ILP problem in real-time, we propose an online algorithm with a competitive ratio guarantee, leveraging primal-dual and dynamic programming techniques. Extensive trace-driven simulations based on real-world DL workloads demonstrate that <sc>Dancer</small> significantly outperforms state-of-the-art approaches, improving the deadline satisfactory ratio up to 58.9%–74.2%.","PeriodicalId":13202,"journal":{"name":"IEEE Transactions on Cloud Computing","volume":"13 2","pages":"590-604"},"PeriodicalIF":5.3000,"publicationDate":"2025-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Cloud Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10916521/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
The explosive growth in training data and model sizes has spurred the adoption of distributed deep learning (DL) in heterogeneous computing clusters. Efficiently scheduling distributed training jobs in such heterogeneous environments while ensuring they meet user-specified deadlines remains a critical challenge. While most existing works focus on reducing job completion time in homogeneous clusters, they pay little attention to meeting job deadlines in heterogeneous clusters. To address this issue, we propose Dancer (Deadline-Aware dyNamiC GPU allocation approach for Efficient Resource utilization), a novel framework that dynamically adjusts not only the number but the type of GPUs assigned to each job throughout its training lifecycle. Dancer aims to maximize the number of jobs meeting their deadlines in heterogeneous GPU clusters. It decouples job placement from resource allocation and formulates the scheduling optimization problem for maximizing the number of deadline-meeting jobs as an Integer Linear Programming (ILP) problem. To solve this ILP problem in real-time, we propose an online algorithm with a competitive ratio guarantee, leveraging primal-dual and dynamic programming techniques. Extensive trace-driven simulations based on real-world DL workloads demonstrate that Dancer significantly outperforms state-of-the-art approaches, improving the deadline satisfactory ratio up to 58.9%–74.2%.
期刊介绍:
The IEEE Transactions on Cloud Computing (TCC) is dedicated to the multidisciplinary field of cloud computing. It is committed to the publication of articles that present innovative research ideas, application results, and case studies in cloud computing, focusing on key technical issues related to theory, algorithms, systems, applications, and performance.