GAS: GPU Allocation Strategy for Deep Learning Training Tasks

IF 0.9 Q4 COMPUTER SCIENCE, SOFTWARE ENGINEERING
Yingwen Chen, Jianchen Han, Huan Zhou, Chen Chen
{"title":"GAS: GPU Allocation Strategy for Deep Learning Training Tasks","authors":"Yingwen Chen, Jianchen Han, Huan Zhou, Chen Chen","doi":"10.1109/SmartWorld-UIC-ATC-ScalCom-DigitalTwin-PriComp-Metaverse56740.2022.00133","DOIUrl":null,"url":null,"abstract":"Nowadays, with the significant increasement of the deep learning training (DLT) task workload in GPU clusters, the number and the scale of GPU clusters grow rapidly. A crucial question is how to efficiently schedule DLT tasks with limited cluster resources. Existing GPU schedulers do not fully consider the connection between users and clusters, and few methods optimize the GPU allocation of DLT tasks. In this study, we propose a scheduling framework for GPU clusters, which improves performance and reduces energy consumption of clusters. We first analyze the relationship between the characteristics of performance and energy consumption and the task configurations for DLT tasks. Then, we propose a prediction method to predict the completion time and energy consumption of DLT tasks. To make better use of cluster resources, based on the prediction model, we propose GAS, which adopts the GPU Allocation Strategy by specifying the parallelism for DLT tasks. Compared to FIFO and SJF schedulers, GAS reduces the makespan by 19.6%-19.8%, reduces the average queueing time by 84.4%-93.9% and reduces the energy consumption by 22.2%22.5%. For users, GAS also reduces the cost of users by 21.3%21.6%. The large-scale simulation experiment further illustrates the effectiveness and scalability of GAS.","PeriodicalId":43791,"journal":{"name":"Scalable Computing-Practice and Experience","volume":"16 1","pages":"880-887"},"PeriodicalIF":0.9000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Scalable Computing-Practice and Experience","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SmartWorld-UIC-ATC-ScalCom-DigitalTwin-PriComp-Metaverse56740.2022.00133","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0

Abstract

Nowadays, with the significant increasement of the deep learning training (DLT) task workload in GPU clusters, the number and the scale of GPU clusters grow rapidly. A crucial question is how to efficiently schedule DLT tasks with limited cluster resources. Existing GPU schedulers do not fully consider the connection between users and clusters, and few methods optimize the GPU allocation of DLT tasks. In this study, we propose a scheduling framework for GPU clusters, which improves performance and reduces energy consumption of clusters. We first analyze the relationship between the characteristics of performance and energy consumption and the task configurations for DLT tasks. Then, we propose a prediction method to predict the completion time and energy consumption of DLT tasks. To make better use of cluster resources, based on the prediction model, we propose GAS, which adopts the GPU Allocation Strategy by specifying the parallelism for DLT tasks. Compared to FIFO and SJF schedulers, GAS reduces the makespan by 19.6%-19.8%, reduces the average queueing time by 84.4%-93.9% and reduces the energy consumption by 22.2%22.5%. For users, GAS also reduces the cost of users by 21.3%21.6%. The large-scale simulation experiment further illustrates the effectiveness and scalability of GAS.
GAS:深度学习训练任务的GPU分配策略
如今,随着GPU集群中深度学习训练(DLT)任务工作量的显著增加,GPU集群的数量和规模迅速增长。一个关键的问题是如何在集群资源有限的情况下有效地调度DLT任务。现有的GPU调度器没有充分考虑用户和集群之间的连接,并且很少有方法优化DLT任务的GPU分配。在本研究中,我们提出了一种GPU集群的调度框架,提高了集群的性能并降低了集群的能耗。我们首先分析了DLT任务的性能和能耗特征与任务配置之间的关系。然后,我们提出了一种预测DLT任务完成时间和能量消耗的方法。为了更好地利用集群资源,我们在预测模型的基础上提出了GAS,它通过指定DLT任务的并行度来采用GPU分配策略。与FIFO和SJF调度器相比,GAS的最大完工时间减少了19.6% ~ 19.8%,平均排队时间减少了84.4% ~ 93.9%,能耗减少了22.2% ~ 22.5%。对于用户来说,GAS也为用户降低了21.3% ~ 21.6%的成本。大规模仿真实验进一步验证了GAS的有效性和可扩展性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Scalable Computing-Practice and Experience
Scalable Computing-Practice and Experience COMPUTER SCIENCE, SOFTWARE ENGINEERING-
CiteScore
2.00
自引率
0.00%
发文量
10
期刊介绍: The area of scalable computing has matured and reached a point where new issues and trends require a professional forum. SCPE will provide this avenue by publishing original refereed papers that address the present as well as the future of parallel and distributed computing. The journal will focus on algorithm development, implementation and execution on real-world parallel architectures, and application of parallel and distributed computing to the solution of real-life problems.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信