Non-Clairvoyant Scheduling of Distributed Machine Learning With Inter-Job and Intra-Job Parallelism on Heterogeneous GPUs

IF 5 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Cloud Computing Pub Date : 2024-06-14 DOI:10.1109/TCC.2024.3414440

Fahao Chen;Peng Li;Celimuge Wu;Song Guo

{"title":"Non-Clairvoyant Scheduling of Distributed Machine Learning With Inter-Job and Intra-Job Parallelism on Heterogeneous GPUs","authors":"Fahao Chen;Peng Li;Celimuge Wu;Song Guo","doi":"10.1109/TCC.2024.3414440","DOIUrl":null,"url":null,"abstract":"Distributed machine learning (DML) has shown great promise in accelerating model training on multiple GPUs. To increase GPU utilization, a common practice is to let multiple learning jobs share GPU clusters, where the most fundamental and critical challenge is how to efficiently schedule these jobs on GPUs. However, existing works about DML job scheduling are constrained to settings with homogeneous GPUs. GPU heterogeneity is common in practice, but its influence on multiple DML job scheduling has been seldom studied. Moreover, DML jobs have internal structures that contain great parallelism potentials, which have not yet been fully exploited in the heterogeneous computing environment. In this paper, we propose \n<italic>Hare</i>\n, a DML job scheduler that exploits both inter-job and intra-job parallelism in a heterogeneous GPU cluster. \n<italic>Hare</i>\n adopts a relaxed fixed-scale synchronization scheme that allows independent tasks to be flexibly scheduled within a training round. Given full knowledge of job arrival time and sizes, we propose a fast heuristic algorithm to minimize the average job completion time and derive its theoretical bound is derived. Without prior knowledge of jobs, we propose an online algorithm based on the Heterogeneity-aware Least-Attained Service (HLAS) policy. We evaluate \n<italic>Hare</i>\n using a small-scale testbed and a trace-driven simulator. The results show that it can outperform the state-of-the-art, achieving a performance improvement of about 2.94×.","PeriodicalId":13202,"journal":{"name":"IEEE Transactions on Cloud Computing","volume":"12 4","pages":"1011-1025"},"PeriodicalIF":5.0000,"publicationDate":"2024-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Cloud Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10557720/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Distributed machine learning (DML) has shown great promise in accelerating model training on multiple GPUs. To increase GPU utilization, a common practice is to let multiple learning jobs share GPU clusters, where the most fundamental and critical challenge is how to efficiently schedule these jobs on GPUs. However, existing works about DML job scheduling are constrained to settings with homogeneous GPUs. GPU heterogeneity is common in practice, but its influence on multiple DML job scheduling has been seldom studied. Moreover, DML jobs have internal structures that contain great parallelism potentials, which have not yet been fully exploited in the heterogeneous computing environment. In this paper, we propose Hare , a DML job scheduler that exploits both inter-job and intra-job parallelism in a heterogeneous GPU cluster. Hare adopts a relaxed fixed-scale synchronization scheme that allows independent tasks to be flexibly scheduled within a training round. Given full knowledge of job arrival time and sizes, we propose a fast heuristic algorithm to minimize the average job completion time and derive its theoretical bound is derived. Without prior knowledge of jobs, we propose an online algorithm based on the Heterogeneity-aware Least-Attained Service (HLAS) policy. We evaluate Hare using a small-scale testbed and a trace-driven simulator. The results show that it can outperform the state-of-the-art, achieving a performance improvement of about 2.94×.

查看原文本刊更多论文

异构 GPU 上具有任务间和任务内并行性的分布式机器学习的非千里眼调度

分布式机器学习（DML）在加速多gpu上的模型训练方面显示出巨大的前景。为了提高GPU利用率，一种常见的做法是让多个学习作业共享GPU集群，其中最基本和最关键的挑战是如何有效地在GPU上调度这些作业。然而，现有的关于DML作业调度的工作仅限于同构gpu的设置。GPU的异构性在实际应用中很常见，但其对多DML作业调度的影响却很少被研究。此外，DML作业的内部结构包含巨大的并行性潜力，这在异构计算环境中尚未得到充分利用。在本文中，我们提出了Hare，一个在异构GPU集群中利用作业间和作业内部并行性的DML作业调度器。Hare采用了一种宽松的固定规模同步方案，允许在一个训练回合内灵活地安排独立任务。在充分了解作业到达时间和作业大小的情况下，提出了一种快速的启发式算法来最小化平均作业完成时间，并推导了其理论边界。在不需要预先了解作业的情况下，我们提出了一种基于异构感知的最小可达服务（HLAS）策略的在线算法。我们使用小型测试平台和跟踪驱动模拟器来评估Hare。结果表明，该算法的性能优于现有算法，提高了约2.94倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Cloud Computing Computer Science-Software

CiteScore

9.40

自引率

6.20%

发文量

167

期刊介绍： The IEEE Transactions on Cloud Computing (TCC) is dedicated to the multidisciplinary field of cloud computing. It is committed to the publication of articles that present innovative research ideas, application results, and case studies in cloud computing, focusing on key technical issues related to theory, algorithms, systems, applications, and performance.