{"title":"TIAS: Two-level Information-Agnostic Job Scheduling in GPU Clusters","authors":"Kun Yang, Jieyu Lin, Wei Ni, Lianghua Song","doi":"10.1109/INSAI54028.2021.00041","DOIUrl":null,"url":null,"abstract":"In recent years, deep learning algorithms have shown a trend towards larger models and larger datasets. Centralized training is unable keep up with the training requirements due to limited storage and computing resources, thus distributed learning is becoming an important area of research for improving learning efficiency. There are many studies on using the features of deep learning workload to design a central scheduler for production clusters.While existing work has been focusing on overall completion time and resource efficiency, little attention has been paid to the execution deadlines. To achieve a balance between the goals of deadline and non-deadline jobs, we design a Two-level Information-Agnostic Scheduling strategy(TIAS), which can schedule the two kinds of jobs together without knowing jobs’ training duration. In the first level, we use different priority calculation methods for the two kinds of jobs; in the second level, we design a new indicator \"queue urgency\" based on three observations to sort deadline jobs within the same queue. Experiments on a trace-driven simulator prove that TIAS can achieve the best trade-off between deadline miss rate and non-deadline jobs’ average job completion time(JCT) compared to existing solutions.","PeriodicalId":232335,"journal":{"name":"2021 International Conference on Networking Systems of AI (INSAI)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 International Conference on Networking Systems of AI (INSAI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/INSAI54028.2021.00041","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
In recent years, deep learning algorithms have shown a trend towards larger models and larger datasets. Centralized training is unable keep up with the training requirements due to limited storage and computing resources, thus distributed learning is becoming an important area of research for improving learning efficiency. There are many studies on using the features of deep learning workload to design a central scheduler for production clusters.While existing work has been focusing on overall completion time and resource efficiency, little attention has been paid to the execution deadlines. To achieve a balance between the goals of deadline and non-deadline jobs, we design a Two-level Information-Agnostic Scheduling strategy(TIAS), which can schedule the two kinds of jobs together without knowing jobs’ training duration. In the first level, we use different priority calculation methods for the two kinds of jobs; in the second level, we design a new indicator "queue urgency" based on three observations to sort deadline jobs within the same queue. Experiments on a trace-driven simulator prove that TIAS can achieve the best trade-off between deadline miss rate and non-deadline jobs’ average job completion time(JCT) compared to existing solutions.