A Survey on the Scheduling of DL and LLM Training Jobs in GPU Clusters

IF 1.6 4区 计算机科学 Q3 ENGINEERING, ELECTRICAL & ELECTRONIC
Tianhao Fu;Zehua Yang;Zhisheng Ye;Chenxiang Ma;Yang Han;Yingwei Luo;Xiaolin Wang;Zhenlin Wang
{"title":"A Survey on the Scheduling of DL and LLM Training Jobs in GPU Clusters","authors":"Tianhao Fu;Zehua Yang;Zhisheng Ye;Chenxiang Ma;Yang Han;Yingwei Luo;Xiaolin Wang;Zhenlin Wang","doi":"10.23919/cje.2024.00.070","DOIUrl":null,"url":null,"abstract":"As deep learning (DL) technology rapidly advances in areas such as computer vision, natural language processing, and more recently, large language models (LLMs), the demand for computing resources has increasingly grown. In particular, scheduling deep learning training (DLT) jobs on graphics processing unit (GPU) clusters has become crucial for the effective utilization of computing resources and the acceleration of model training processes. However, resource management and scheduling in GPU clusters face challenges related to computing and communication, including job sharing, interference, elastic scheduling, heterogeneous resources, and fairness. This survey investigates the scheduling issues of DLT jobs in GPU clusters, focusing on scheduling optimizations at the job characteristic and cluster resource levels. We analyze the structure and training computing characteristics of traditional DL models and LLMs, as well as their requirements for iterative computation, communication, GPU sharing, and resource elasticity. In addition, we compare the main contributions of this survey with related reviews and discuss research directions, including scheduling based on job characteristics and optimization strategies for cluster resources. This survey aims to provide researchers and practitioners with a comprehensive understanding of DLT job scheduling in GPU clusters and to point out directions for future research.","PeriodicalId":50701,"journal":{"name":"Chinese Journal of Electronics","volume":"34 3","pages":"881-905"},"PeriodicalIF":1.6000,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11060018","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Chinese Journal of Electronics","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11060018/","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

Abstract

As deep learning (DL) technology rapidly advances in areas such as computer vision, natural language processing, and more recently, large language models (LLMs), the demand for computing resources has increasingly grown. In particular, scheduling deep learning training (DLT) jobs on graphics processing unit (GPU) clusters has become crucial for the effective utilization of computing resources and the acceleration of model training processes. However, resource management and scheduling in GPU clusters face challenges related to computing and communication, including job sharing, interference, elastic scheduling, heterogeneous resources, and fairness. This survey investigates the scheduling issues of DLT jobs in GPU clusters, focusing on scheduling optimizations at the job characteristic and cluster resource levels. We analyze the structure and training computing characteristics of traditional DL models and LLMs, as well as their requirements for iterative computation, communication, GPU sharing, and resource elasticity. In addition, we compare the main contributions of this survey with related reviews and discuss research directions, including scheduling based on job characteristics and optimization strategies for cluster resources. This survey aims to provide researchers and practitioners with a comprehensive understanding of DLT job scheduling in GPU clusters and to point out directions for future research.
GPU集群中DL和LLM训练任务调度研究
随着深度学习(DL)技术在计算机视觉、自然语言处理以及最近的大型语言模型(llm)等领域的迅速发展,对计算资源的需求日益增长。特别是,在图形处理单元(GPU)集群上调度深度学习训练(DLT)作业对于有效利用计算资源和加速模型训练过程至关重要。然而,GPU集群的资源管理和调度面临着作业共享、干扰、弹性调度、资源异构和公平性等计算和通信方面的挑战。本调查调查了GPU集群中DLT作业的调度问题,重点关注作业特征和集群资源级别的调度优化。我们分析了传统深度学习模型和llm的结构和训练计算特点,以及它们对迭代计算、通信、GPU共享和资源弹性的要求。此外,我们还将本调查的主要贡献与相关综述进行了比较,并讨论了研究方向,包括基于作业特征的调度和集群资源的优化策略。本调查旨在为研究人员和从业者提供对GPU集群中DLT作业调度的全面了解,并为未来的研究指明方向。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Chinese Journal of Electronics
Chinese Journal of Electronics 工程技术-工程:电子与电气
CiteScore
3.70
自引率
16.70%
发文量
342
审稿时长
12.0 months
期刊介绍: CJE focuses on the emerging fields of electronics, publishing innovative and transformative research papers. Most of the papers published in CJE are from universities and research institutes, presenting their innovative research results. Both theoretical and practical contributions are encouraged, and original research papers reporting novel solutions to the hot topics in electronics are strongly recommended.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信