Tianhao Fu;Zehua Yang;Zhisheng Ye;Chenxiang Ma;Yang Han;Yingwei Luo;Xiaolin Wang;Zhenlin Wang
{"title":"GPU集群中DL和LLM训练任务调度研究","authors":"Tianhao Fu;Zehua Yang;Zhisheng Ye;Chenxiang Ma;Yang Han;Yingwei Luo;Xiaolin Wang;Zhenlin Wang","doi":"10.23919/cje.2024.00.070","DOIUrl":null,"url":null,"abstract":"As deep learning (DL) technology rapidly advances in areas such as computer vision, natural language processing, and more recently, large language models (LLMs), the demand for computing resources has increasingly grown. In particular, scheduling deep learning training (DLT) jobs on graphics processing unit (GPU) clusters has become crucial for the effective utilization of computing resources and the acceleration of model training processes. However, resource management and scheduling in GPU clusters face challenges related to computing and communication, including job sharing, interference, elastic scheduling, heterogeneous resources, and fairness. This survey investigates the scheduling issues of DLT jobs in GPU clusters, focusing on scheduling optimizations at the job characteristic and cluster resource levels. We analyze the structure and training computing characteristics of traditional DL models and LLMs, as well as their requirements for iterative computation, communication, GPU sharing, and resource elasticity. In addition, we compare the main contributions of this survey with related reviews and discuss research directions, including scheduling based on job characteristics and optimization strategies for cluster resources. This survey aims to provide researchers and practitioners with a comprehensive understanding of DLT job scheduling in GPU clusters and to point out directions for future research.","PeriodicalId":50701,"journal":{"name":"Chinese Journal of Electronics","volume":"34 3","pages":"881-905"},"PeriodicalIF":1.6000,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11060018","citationCount":"0","resultStr":"{\"title\":\"A Survey on the Scheduling of DL and LLM Training Jobs in GPU Clusters\",\"authors\":\"Tianhao Fu;Zehua Yang;Zhisheng Ye;Chenxiang Ma;Yang Han;Yingwei Luo;Xiaolin Wang;Zhenlin Wang\",\"doi\":\"10.23919/cje.2024.00.070\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As deep learning (DL) technology rapidly advances in areas such as computer vision, natural language processing, and more recently, large language models (LLMs), the demand for computing resources has increasingly grown. In particular, scheduling deep learning training (DLT) jobs on graphics processing unit (GPU) clusters has become crucial for the effective utilization of computing resources and the acceleration of model training processes. However, resource management and scheduling in GPU clusters face challenges related to computing and communication, including job sharing, interference, elastic scheduling, heterogeneous resources, and fairness. This survey investigates the scheduling issues of DLT jobs in GPU clusters, focusing on scheduling optimizations at the job characteristic and cluster resource levels. We analyze the structure and training computing characteristics of traditional DL models and LLMs, as well as their requirements for iterative computation, communication, GPU sharing, and resource elasticity. In addition, we compare the main contributions of this survey with related reviews and discuss research directions, including scheduling based on job characteristics and optimization strategies for cluster resources. This survey aims to provide researchers and practitioners with a comprehensive understanding of DLT job scheduling in GPU clusters and to point out directions for future research.\",\"PeriodicalId\":50701,\"journal\":{\"name\":\"Chinese Journal of Electronics\",\"volume\":\"34 3\",\"pages\":\"881-905\"},\"PeriodicalIF\":1.6000,\"publicationDate\":\"2025-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11060018\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Chinese Journal of Electronics\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11060018/\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Chinese Journal of Electronics","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11060018/","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
A Survey on the Scheduling of DL and LLM Training Jobs in GPU Clusters
As deep learning (DL) technology rapidly advances in areas such as computer vision, natural language processing, and more recently, large language models (LLMs), the demand for computing resources has increasingly grown. In particular, scheduling deep learning training (DLT) jobs on graphics processing unit (GPU) clusters has become crucial for the effective utilization of computing resources and the acceleration of model training processes. However, resource management and scheduling in GPU clusters face challenges related to computing and communication, including job sharing, interference, elastic scheduling, heterogeneous resources, and fairness. This survey investigates the scheduling issues of DLT jobs in GPU clusters, focusing on scheduling optimizations at the job characteristic and cluster resource levels. We analyze the structure and training computing characteristics of traditional DL models and LLMs, as well as their requirements for iterative computation, communication, GPU sharing, and resource elasticity. In addition, we compare the main contributions of this survey with related reviews and discuss research directions, including scheduling based on job characteristics and optimization strategies for cluster resources. This survey aims to provide researchers and practitioners with a comprehensive understanding of DLT job scheduling in GPU clusters and to point out directions for future research.
期刊介绍:
CJE focuses on the emerging fields of electronics, publishing innovative and transformative research papers. Most of the papers published in CJE are from universities and research institutes, presenting their innovative research results. Both theoretical and practical contributions are encouraged, and original research papers reporting novel solutions to the hot topics in electronics are strongly recommended.