GPU集群中DL和LLM训练任务调度研究

IF 1.6 4区计算机科学 Q3 ENGINEERING, ELECTRICAL & ELECTRONIC

Chinese Journal of Electronics Pub Date : 2025-03-01 DOI:10.23919/cje.2024.00.070

Tianhao Fu;Zehua Yang;Zhisheng Ye;Chenxiang Ma;Yang Han;Yingwei Luo;Xiaolin Wang;Zhenlin Wang

{"title":"GPU集群中DL和LLM训练任务调度研究","authors":"Tianhao Fu;Zehua Yang;Zhisheng Ye;Chenxiang Ma;Yang Han;Yingwei Luo;Xiaolin Wang;Zhenlin Wang","doi":"10.23919/cje.2024.00.070","DOIUrl":null,"url":null,"abstract":"As deep learning (DL) technology rapidly advances in areas such as computer vision, natural language processing, and more recently, large language models (LLMs), the demand for computing resources has increasingly grown. In particular, scheduling deep learning training (DLT) jobs on graphics processing unit (GPU) clusters has become crucial for the effective utilization of computing resources and the acceleration of model training processes. However, resource management and scheduling in GPU clusters face challenges related to computing and communication, including job sharing, interference, elastic scheduling, heterogeneous resources, and fairness. This survey investigates the scheduling issues of DLT jobs in GPU clusters, focusing on scheduling optimizations at the job characteristic and cluster resource levels. We analyze the structure and training computing characteristics of traditional DL models and LLMs, as well as their requirements for iterative computation, communication, GPU sharing, and resource elasticity. In addition, we compare the main contributions of this survey with related reviews and discuss research directions, including scheduling based on job characteristics and optimization strategies for cluster resources. This survey aims to provide researchers and practitioners with a comprehensive understanding of DLT job scheduling in GPU clusters and to point out directions for future research.","PeriodicalId":50701,"journal":{"name":"Chinese Journal of Electronics","volume":"34 3","pages":"881-905"},"PeriodicalIF":1.6000,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11060018","citationCount":"0","resultStr":"{\"title\":\"A Survey on the Scheduling of DL and LLM Training Jobs in GPU Clusters\",\"authors\":\"Tianhao Fu;Zehua Yang;Zhisheng Ye;Chenxiang Ma;Yang Han;Yingwei Luo;Xiaolin Wang;Zhenlin Wang\",\"doi\":\"10.23919/cje.2024.00.070\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As deep learning (DL) technology rapidly advances in areas such as computer vision, natural language processing, and more recently, large language models (LLMs), the demand for computing resources has increasingly grown. In particular, scheduling deep learning training (DLT) jobs on graphics processing unit (GPU) clusters has become crucial for the effective utilization of computing resources and the acceleration of model training processes. However, resource management and scheduling in GPU clusters face challenges related to computing and communication, including job sharing, interference, elastic scheduling, heterogeneous resources, and fairness. This survey investigates the scheduling issues of DLT jobs in GPU clusters, focusing on scheduling optimizations at the job characteristic and cluster resource levels. We analyze the structure and training computing characteristics of traditional DL models and LLMs, as well as their requirements for iterative computation, communication, GPU sharing, and resource elasticity. In addition, we compare the main contributions of this survey with related reviews and discuss research directions, including scheduling based on job characteristics and optimization strategies for cluster resources. This survey aims to provide researchers and practitioners with a comprehensive understanding of DLT job scheduling in GPU clusters and to point out directions for future research.\",\"PeriodicalId\":50701,\"journal\":{\"name\":\"Chinese Journal of Electronics\",\"volume\":\"34 3\",\"pages\":\"881-905\"},\"PeriodicalIF\":1.6000,\"publicationDate\":\"2025-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11060018\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Chinese Journal of Electronics\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11060018/\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Chinese Journal of Electronics","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11060018/","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

摘要

随着深度学习（DL）技术在计算机视觉、自然语言处理以及最近的大型语言模型（llm）等领域的迅速发展，对计算资源的需求日益增长。特别是，在图形处理单元（GPU）集群上调度深度学习训练（DLT）作业对于有效利用计算资源和加速模型训练过程至关重要。然而，GPU集群的资源管理和调度面临着作业共享、干扰、弹性调度、资源异构和公平性等计算和通信方面的挑战。本调查调查了GPU集群中DLT作业的调度问题，重点关注作业特征和集群资源级别的调度优化。我们分析了传统深度学习模型和llm的结构和训练计算特点，以及它们对迭代计算、通信、GPU共享和资源弹性的要求。此外，我们还将本调查的主要贡献与相关综述进行了比较，并讨论了研究方向，包括基于作业特征的调度和集群资源的优化策略。本调查旨在为研究人员和从业者提供对GPU集群中DLT作业调度的全面了解，并为未来的研究指明方向。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Survey on the Scheduling of DL and LLM Training Jobs in GPU Clusters

As deep learning (DL) technology rapidly advances in areas such as computer vision, natural language processing, and more recently, large language models (LLMs), the demand for computing resources has increasingly grown. In particular, scheduling deep learning training (DLT) jobs on graphics processing unit (GPU) clusters has become crucial for the effective utilization of computing resources and the acceleration of model training processes. However, resource management and scheduling in GPU clusters face challenges related to computing and communication, including job sharing, interference, elastic scheduling, heterogeneous resources, and fairness. This survey investigates the scheduling issues of DLT jobs in GPU clusters, focusing on scheduling optimizations at the job characteristic and cluster resource levels. We analyze the structure and training computing characteristics of traditional DL models and LLMs, as well as their requirements for iterative computation, communication, GPU sharing, and resource elasticity. In addition, we compare the main contributions of this survey with related reviews and discuss research directions, including scheduling based on job characteristics and optimization strategies for cluster resources. This survey aims to provide researchers and practitioners with a comprehensive understanding of DLT job scheduling in GPU clusters and to point out directions for future research.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Chinese Journal of Electronics 工程技术-工程：电子与电气

CiteScore

3.70

自引率

16.70%

发文量

342

审稿时长

12.0 months

期刊介绍： CJE focuses on the emerging fields of electronics, publishing innovative and transformative research papers. Most of the papers published in CJE are from universities and research institutes, presenting their innovative research results. Both theoretical and practical contributions are encouraged, and original research papers reporting novel solutions to the hot topics in electronics are strongly recommended.