Hajer Ayadi , Aijun An , Yiming Shao , Hossein Pourmedheji , Junjie Deng , Jimmy X. Huang , Michael Feiman , Hao Zhou
{"title":"Topology-aware GPU job scheduling with deep reinforcement learning and heuristics","authors":"Hajer Ayadi , Aijun An , Yiming Shao , Hossein Pourmedheji , Junjie Deng , Jimmy X. Huang , Michael Feiman , Hao Zhou","doi":"10.1016/j.jpdc.2025.105138","DOIUrl":null,"url":null,"abstract":"<div><div>Deep neural networks (DNNs) have gained popularity in many fields such as computer vision, and natural language processing. However, the increasing size of data and complexity of models have made training DNNs time-consuming. While distributed DNN training using multiple GPUs in parallel is a common solution, it introduces challenges in GPU resource management and scheduling. One key challenge is minimizing communication costs among GPUs assigned to a DNN training job. High communication costs—arising from factors such as inter-rack or inter-machine data transfers—can lead to hardware bottlenecks and network delays, ultimately slowing down training. Reducing these costs facilitates more efficient data transfer and synchronization, directly accelerating the training process. Although deep reinforcement learning (DRL) has shown promise in GPU resource scheduling, existing methods often lack considerations for hardware topology. Moreover, most proposed GPU schedulers ignore the possibility of combining heuristic and DRL policies. In response to these challenges, we introduce <span><math><mi>T</mi><mi>o</mi><mi>p</mi><mi>D</mi><mi>R</mi><mi>L</mi></math></span>, an innovative hybrid scheduler that integrates deep reinforcement learning (DRL) and heuristic methods to enhance GPU job scheduling. <span><math><mi>T</mi><mi>o</mi><mi>p</mi><mi>D</mi><mi>R</mi><mi>L</mi></math></span> uses a multi-branch convolutional neural network (CNN) model for job selection and a heuristic method for GPU allocation. At each time step, the CNN model selects a job, and then a heuristic method selects available GPUs closest to each other from the cluster. Reinforcement learning (RL) is used to train the CNN model to select the job that maximizes throughput-based rewards. Extensive evaluation, conducted on datasets with real jobs, shows that <span><math><mi>T</mi><mi>o</mi><mi>p</mi><mi>D</mi><mi>R</mi><mi>L</mi></math></span> significantly outperforms six baseline schedulers that use heuristics or other DRL models for job picking and resource allocation.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"204 ","pages":"Article 105138"},"PeriodicalIF":3.4000,"publicationDate":"2025-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Parallel and Distributed Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0743731525001054","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
Deep neural networks (DNNs) have gained popularity in many fields such as computer vision, and natural language processing. However, the increasing size of data and complexity of models have made training DNNs time-consuming. While distributed DNN training using multiple GPUs in parallel is a common solution, it introduces challenges in GPU resource management and scheduling. One key challenge is minimizing communication costs among GPUs assigned to a DNN training job. High communication costs—arising from factors such as inter-rack or inter-machine data transfers—can lead to hardware bottlenecks and network delays, ultimately slowing down training. Reducing these costs facilitates more efficient data transfer and synchronization, directly accelerating the training process. Although deep reinforcement learning (DRL) has shown promise in GPU resource scheduling, existing methods often lack considerations for hardware topology. Moreover, most proposed GPU schedulers ignore the possibility of combining heuristic and DRL policies. In response to these challenges, we introduce , an innovative hybrid scheduler that integrates deep reinforcement learning (DRL) and heuristic methods to enhance GPU job scheduling. uses a multi-branch convolutional neural network (CNN) model for job selection and a heuristic method for GPU allocation. At each time step, the CNN model selects a job, and then a heuristic method selects available GPUs closest to each other from the cluster. Reinforcement learning (RL) is used to train the CNN model to select the job that maximizes throughput-based rewards. Extensive evaluation, conducted on datasets with real jobs, shows that significantly outperforms six baseline schedulers that use heuristics or other DRL models for job picking and resource allocation.
期刊介绍:
This international journal is directed to researchers, engineers, educators, managers, programmers, and users of computers who have particular interests in parallel processing and/or distributed computing.
The Journal of Parallel and Distributed Computing publishes original research papers and timely review articles on the theory, design, evaluation, and use of parallel and/or distributed computing systems. The journal also features special issues on these topics; again covering the full range from the design to the use of our targeted systems.