{"title":"GAS: GPU Allocation Strategy for Deep Learning Training Tasks","authors":"Yingwen Chen, Jianchen Han, Huan Zhou, Chen Chen","doi":"10.1109/SmartWorld-UIC-ATC-ScalCom-DigitalTwin-PriComp-Metaverse56740.2022.00133","DOIUrl":null,"url":null,"abstract":"Nowadays, with the significant increasement of the deep learning training (DLT) task workload in GPU clusters, the number and the scale of GPU clusters grow rapidly. A crucial question is how to efficiently schedule DLT tasks with limited cluster resources. Existing GPU schedulers do not fully consider the connection between users and clusters, and few methods optimize the GPU allocation of DLT tasks. In this study, we propose a scheduling framework for GPU clusters, which improves performance and reduces energy consumption of clusters. We first analyze the relationship between the characteristics of performance and energy consumption and the task configurations for DLT tasks. Then, we propose a prediction method to predict the completion time and energy consumption of DLT tasks. To make better use of cluster resources, based on the prediction model, we propose GAS, which adopts the GPU Allocation Strategy by specifying the parallelism for DLT tasks. Compared to FIFO and SJF schedulers, GAS reduces the makespan by 19.6%-19.8%, reduces the average queueing time by 84.4%-93.9% and reduces the energy consumption by 22.2%22.5%. For users, GAS also reduces the cost of users by 21.3%21.6%. The large-scale simulation experiment further illustrates the effectiveness and scalability of GAS.","PeriodicalId":43791,"journal":{"name":"Scalable Computing-Practice and Experience","volume":"16 1","pages":"880-887"},"PeriodicalIF":0.9000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Scalable Computing-Practice and Experience","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SmartWorld-UIC-ATC-ScalCom-DigitalTwin-PriComp-Metaverse56740.2022.00133","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0
Abstract
Nowadays, with the significant increasement of the deep learning training (DLT) task workload in GPU clusters, the number and the scale of GPU clusters grow rapidly. A crucial question is how to efficiently schedule DLT tasks with limited cluster resources. Existing GPU schedulers do not fully consider the connection between users and clusters, and few methods optimize the GPU allocation of DLT tasks. In this study, we propose a scheduling framework for GPU clusters, which improves performance and reduces energy consumption of clusters. We first analyze the relationship between the characteristics of performance and energy consumption and the task configurations for DLT tasks. Then, we propose a prediction method to predict the completion time and energy consumption of DLT tasks. To make better use of cluster resources, based on the prediction model, we propose GAS, which adopts the GPU Allocation Strategy by specifying the parallelism for DLT tasks. Compared to FIFO and SJF schedulers, GAS reduces the makespan by 19.6%-19.8%, reduces the average queueing time by 84.4%-93.9% and reduces the energy consumption by 22.2%22.5%. For users, GAS also reduces the cost of users by 21.3%21.6%. The large-scale simulation experiment further illustrates the effectiveness and scalability of GAS.
期刊介绍:
The area of scalable computing has matured and reached a point where new issues and trends require a professional forum. SCPE will provide this avenue by publishing original refereed papers that address the present as well as the future of parallel and distributed computing. The journal will focus on algorithm development, implementation and execution on real-world parallel architectures, and application of parallel and distributed computing to the solution of real-life problems.