Deadline-Aware Online Job Scheduling for Distributed Training in Heterogeneous Clusters

IF 5.3 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Cloud Computing Pub Date : 2025-03-06 DOI:10.1109/TCC.2025.3548604

Yuchen Zhang;Long Luo;Gang Sun;Hongfang Yu;Bo Li

{"title":"Deadline-Aware Online Job Scheduling for Distributed Training in Heterogeneous Clusters","authors":"Yuchen Zhang;Long Luo;Gang Sun;Hongfang Yu;Bo Li","doi":"10.1109/TCC.2025.3548604","DOIUrl":null,"url":null,"abstract":"The explosive growth in training data and model sizes has spurred the adoption of distributed deep learning (DL) in heterogeneous computing clusters. Efficiently scheduling distributed training jobs in such heterogeneous environments while ensuring they meet user-specified deadlines remains a critical challenge. While most existing works focus on reducing job completion time in homogeneous clusters, they pay little attention to meeting job deadlines in heterogeneous clusters. To address this issue, we propose <sc>Dancer</small> (Deadline-Aware dyNamiC GPU allocation approach for Efficient Resource utilization), a novel framework that dynamically adjusts not only the number but the type of GPUs assigned to each job throughout its training lifecycle. <sc>Dancer</small> aims to maximize the number of jobs meeting their deadlines in heterogeneous GPU clusters. It decouples job placement from resource allocation and formulates the scheduling optimization problem for maximizing the number of deadline-meeting jobs as an Integer Linear Programming (ILP) problem. To solve this ILP problem in real-time, we propose an online algorithm with a competitive ratio guarantee, leveraging primal-dual and dynamic programming techniques. Extensive trace-driven simulations based on real-world DL workloads demonstrate that <sc>Dancer</small> significantly outperforms state-of-the-art approaches, improving the deadline satisfactory ratio up to 58.9%–74.2%.","PeriodicalId":13202,"journal":{"name":"IEEE Transactions on Cloud Computing","volume":"13 2","pages":"590-604"},"PeriodicalIF":5.3000,"publicationDate":"2025-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Cloud Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10916521/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

The explosive growth in training data and model sizes has spurred the adoption of distributed deep learning (DL) in heterogeneous computing clusters. Efficiently scheduling distributed training jobs in such heterogeneous environments while ensuring they meet user-specified deadlines remains a critical challenge. While most existing works focus on reducing job completion time in homogeneous clusters, they pay little attention to meeting job deadlines in heterogeneous clusters. To address this issue, we propose Dancer (Deadline-Aware dyNamiC GPU allocation approach for Efficient Resource utilization), a novel framework that dynamically adjusts not only the number but the type of GPUs assigned to each job throughout its training lifecycle. Dancer aims to maximize the number of jobs meeting their deadlines in heterogeneous GPU clusters. It decouples job placement from resource allocation and formulates the scheduling optimization problem for maximizing the number of deadline-meeting jobs as an Integer Linear Programming (ILP) problem. To solve this ILP problem in real-time, we propose an online algorithm with a competitive ratio guarantee, leveraging primal-dual and dynamic programming techniques. Extensive trace-driven simulations based on real-world DL workloads demonstrate that Dancer significantly outperforms state-of-the-art approaches, improving the deadline satisfactory ratio up to 58.9%–74.2%.

查看原文本刊更多论文

异构集群分布式训练的截止日期感知在线作业调度

训练数据和模型规模的爆炸式增长刺激了分布式深度学习（DL）在异构计算集群中的应用。在这样的异构环境中，有效地安排分布式培训工作，同时确保它们满足用户指定的最后期限仍然是一个关键的挑战。虽然现有的大多数工作都集中在减少同构集群中的作业完成时间，但他们很少关注异构集群中的作业完成时间。为了解决这个问题，我们提出了Dancer（有效资源利用的截止日期感知动态GPU分配方法），这是一个新颖的框架，在整个训练生命周期中，不仅动态调整分配给每个作业的GPU的数量，而且动态调整GPU的类型。Dancer的目标是在异构GPU集群中最大限度地满足其截止日期的作业数量。它将工作分配与资源分配解耦，并将最大限度地满足限期作业数量的调度优化问题表述为整数线性规划（ILP）问题。为了实时解决这一ILP问题，我们提出了一种利用原始对偶和动态规划技术的具有竞争比率保证的在线算法。基于真实深度学习工作负载的大量跟踪驱动模拟表明，Dancer明显优于最先进的方法，将截止日期满意率提高到58.9%-74.2%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Cloud Computing Computer Science-Software

CiteScore

9.40

自引率

6.20%

发文量

167

期刊介绍： The IEEE Transactions on Cloud Computing (TCC) is dedicated to the multidisciplinary field of cloud computing. It is committed to the publication of articles that present innovative research ideas, application results, and case studies in cloud computing, focusing on key technical issues related to theory, algorithms, systems, applications, and performance.