Topology-aware GPU job scheduling with deep reinforcement learning and heuristics

IF 3.4 3区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Parallel and Distributed Computing Pub Date : 2025-06-26 DOI:10.1016/j.jpdc.2025.105138

Hajer Ayadi , Aijun An , Yiming Shao , Hossein Pourmedheji , Junjie Deng , Jimmy X. Huang , Michael Feiman , Hao Zhou

{"title":"Topology-aware GPU job scheduling with deep reinforcement learning and heuristics","authors":"Hajer Ayadi , Aijun An , Yiming Shao , Hossein Pourmedheji , Junjie Deng , Jimmy X. Huang , Michael Feiman , Hao Zhou","doi":"10.1016/j.jpdc.2025.105138","DOIUrl":null,"url":null,"abstract":"<div><div>Deep neural networks (DNNs) have gained popularity in many fields such as computer vision, and natural language processing. However, the increasing size of data and complexity of models have made training DNNs time-consuming. While distributed DNN training using multiple GPUs in parallel is a common solution, it introduces challenges in GPU resource management and scheduling. One key challenge is minimizing communication costs among GPUs assigned to a DNN training job. High communication costs—arising from factors such as inter-rack or inter-machine data transfers—can lead to hardware bottlenecks and network delays, ultimately slowing down training. Reducing these costs facilitates more efficient data transfer and synchronization, directly accelerating the training process. Although deep reinforcement learning (DRL) has shown promise in GPU resource scheduling, existing methods often lack considerations for hardware topology. Moreover, most proposed GPU schedulers ignore the possibility of combining heuristic and DRL policies. In response to these challenges, we introduce <span><math><mi>T</mi><mi>o</mi><mi>p</mi><mi>D</mi><mi>R</mi><mi>L</mi></math></span>, an innovative hybrid scheduler that integrates deep reinforcement learning (DRL) and heuristic methods to enhance GPU job scheduling. <span><math><mi>T</mi><mi>o</mi><mi>p</mi><mi>D</mi><mi>R</mi><mi>L</mi></math></span> uses a multi-branch convolutional neural network (CNN) model for job selection and a heuristic method for GPU allocation. At each time step, the CNN model selects a job, and then a heuristic method selects available GPUs closest to each other from the cluster. Reinforcement learning (RL) is used to train the CNN model to select the job that maximizes throughput-based rewards. Extensive evaluation, conducted on datasets with real jobs, shows that <span><math><mi>T</mi><mi>o</mi><mi>p</mi><mi>D</mi><mi>R</mi><mi>L</mi></math></span> significantly outperforms six baseline schedulers that use heuristics or other DRL models for job picking and resource allocation.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"204 ","pages":"Article 105138"},"PeriodicalIF":3.4000,"publicationDate":"2025-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Parallel and Distributed Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0743731525001054","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Deep neural networks (DNNs) have gained popularity in many fields such as computer vision, and natural language processing. However, the increasing size of data and complexity of models have made training DNNs time-consuming. While distributed DNN training using multiple GPUs in parallel is a common solution, it introduces challenges in GPU resource management and scheduling. One key challenge is minimizing communication costs among GPUs assigned to a DNN training job. High communication costs—arising from factors such as inter-rack or inter-machine data transfers—can lead to hardware bottlenecks and network delays, ultimately slowing down training. Reducing these costs facilitates more efficient data transfer and synchronization, directly accelerating the training process. Although deep reinforcement learning (DRL) has shown promise in GPU resource scheduling, existing methods often lack considerations for hardware topology. Moreover, most proposed GPU schedulers ignore the possibility of combining heuristic and DRL policies. In response to these challenges, we introduce

T o p D R L

, an innovative hybrid scheduler that integrates deep reinforcement learning (DRL) and heuristic methods to enhance GPU job scheduling.

T o p D R L

uses a multi-branch convolutional neural network (CNN) model for job selection and a heuristic method for GPU allocation. At each time step, the CNN model selects a job, and then a heuristic method selects available GPUs closest to each other from the cluster. Reinforcement learning (RL) is used to train the CNN model to select the job that maximizes throughput-based rewards. Extensive evaluation, conducted on datasets with real jobs, shows that

T o p D R L

significantly outperforms six baseline schedulers that use heuristics or other DRL models for job picking and resource allocation.

查看原文本刊更多论文

基于深度强化学习和启发式的拓扑感知GPU作业调度

深度神经网络（dnn）在计算机视觉和自然语言处理等许多领域都得到了普及。然而，随着数据量和模型复杂度的增加，训练深度神经网络的时间越来越长。虽然使用多个GPU并行进行分布式DNN训练是一种常见的解决方案，但它在GPU资源管理和调度方面带来了挑战。一个关键的挑战是最小化分配给DNN训练任务的gpu之间的通信成本。高昂的通信成本（由机架间或机器间数据传输等因素引起）可能导致硬件瓶颈和网络延迟，最终减慢训练速度。降低这些成本有助于更有效的数据传输和同步，直接加速训练过程。尽管深度强化学习（DRL）在GPU资源调度方面表现出了良好的前景，但现有的方法往往缺乏对硬件拓扑的考虑。此外，大多数提出的GPU调度器忽略了启发式策略和DRL策略相结合的可能性。为了应对这些挑战，我们引入了TopDRL，这是一种创新的混合调度器，它集成了深度强化学习（DRL）和启发式方法来增强GPU的作业调度。TopDRL使用多分支卷积神经网络（CNN）模型进行作业选择，并使用启发式方法进行GPU分配。在每个时间步，CNN模型选择一个作业，然后采用启发式方法从集群中选择彼此最接近的可用gpu。强化学习（RL）用于训练CNN模型选择最大吞吐量奖励的任务。对具有真实作业的数据集进行的广泛评估表明，TopDRL明显优于使用启发式或其他DRL模型进行作业选择和资源分配的六个基准调度器。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Parallel and Distributed Computing 工程技术-计算机：理论方法

CiteScore

10.30

自引率

2.60%

发文量

172

审稿时长

12 months

期刊介绍： This international journal is directed to researchers, engineers, educators, managers, programmers, and users of computers who have particular interests in parallel processing and/or distributed computing. The Journal of Parallel and Distributed Computing publishes original research papers and timely review articles on the theory, design, evaluation, and use of parallel and/or distributed computing systems. The journal also features special issues on these topics; again covering the full range from the design to the use of our targeted systems.