Baolin Li, Tirthak Patel, V. Gadepally, K. Gettings, S. Samsi, Devesh Tiwari
{"title":"DASH:在多代gpu加速集群上调度深度学习工作负载","authors":"Baolin Li, Tirthak Patel, V. Gadepally, K. Gettings, S. Samsi, Devesh Tiwari","doi":"10.1109/HPEC55821.2022.9926390","DOIUrl":null,"url":null,"abstract":"Two notable characteristics of modern GPU-accelerated HPC clusters are: (1) they increasingly run deep learning (DL) model-training workloads, and (2) they consist of multiple generations of GPUs, i.e., they are heterogeneous. However, existing works in GPU cluster scheduling for DL workloads have not addressed the GPU multi-generation problem. We propose DASH, a GPU cluster scheduler designed to optimally make a match between different DL workloads and GPU types in a multi-generational GPU environment. By lever-aging execution characteristics of co-scheduled DL workloads, DASH can improve the average job runtime by 17% and the average job completion time by 14 % compared to the traditional heterogeneity-unaware job scheduler.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"DASH: Scheduling Deep Learning Workloads on Multi-Generational GPU-Accelerated Clusters\",\"authors\":\"Baolin Li, Tirthak Patel, V. Gadepally, K. Gettings, S. Samsi, Devesh Tiwari\",\"doi\":\"10.1109/HPEC55821.2022.9926390\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Two notable characteristics of modern GPU-accelerated HPC clusters are: (1) they increasingly run deep learning (DL) model-training workloads, and (2) they consist of multiple generations of GPUs, i.e., they are heterogeneous. However, existing works in GPU cluster scheduling for DL workloads have not addressed the GPU multi-generation problem. We propose DASH, a GPU cluster scheduler designed to optimally make a match between different DL workloads and GPU types in a multi-generational GPU environment. By lever-aging execution characteristics of co-scheduled DL workloads, DASH can improve the average job runtime by 17% and the average job completion time by 14 % compared to the traditional heterogeneity-unaware job scheduler.\",\"PeriodicalId\":200071,\"journal\":{\"name\":\"2022 IEEE High Performance Extreme Computing Conference (HPEC)\",\"volume\":\"41 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-09-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE High Performance Extreme Computing Conference (HPEC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HPEC55821.2022.9926390\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPEC55821.2022.9926390","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
DASH: Scheduling Deep Learning Workloads on Multi-Generational GPU-Accelerated Clusters
Two notable characteristics of modern GPU-accelerated HPC clusters are: (1) they increasingly run deep learning (DL) model-training workloads, and (2) they consist of multiple generations of GPUs, i.e., they are heterogeneous. However, existing works in GPU cluster scheduling for DL workloads have not addressed the GPU multi-generation problem. We propose DASH, a GPU cluster scheduler designed to optimally make a match between different DL workloads and GPU types in a multi-generational GPU environment. By lever-aging execution characteristics of co-scheduled DL workloads, DASH can improve the average job runtime by 17% and the average job completion time by 14 % compared to the traditional heterogeneity-unaware job scheduler.