{"title":"云环境下自适应高效GPU时间共享超参数调优","authors":"L. Liu, Jian Yu, Zhijun Ding","doi":"10.1145/3545008.3545027","DOIUrl":null,"url":null,"abstract":"Hyperparameter tuning (HPT), which chooses a set of optimal hyperparameters for a learning algorithm, is critical to machine learning training. Unfortunately, the current resource provisioning approaches for HPT are unable to adjust resources adaptively according to the upward trends of HPT accuracy at runtime, resulting in low GPU utilization or HPT accuracy. On the other hand, dynamic resource provisioning approaches based on checkpointing are inefficient for HPT, because of high overhead of context switching and job restarting. This paper presents DISC, an adaptive and efficient HPT service with GPU time sharing for the cloud, which aims to improve GPU utilization and HPT accuracy. DISC provides a potential-aware GPU adaptive scaling to adjust the size of GPU time slices occupied by HPT jobs at runtime based on the upward trends of HPT accuracy. The dynamic allocation of GPU time slices is formalized as an optimization problem and tackled with an effective heuristic algorithm. Further, DISC achieves GPU memory temporal and spatial sharing according to the memory usage pattern of HPT jobs. It designs a time slice early release mechanism with relaxed PACK scheduling to improve memory utilization while avoiding memory overflow of the GPU due to time sharing. DISC is implemented upon the Kubeflow and Kubernetes ecosystem. We adopt a subset of Microsoft Philly Trace with public datasets to conduct evaluation. Experimental results show that DISC improves the average job completion time by 1.15x compared to the naïve approach and the HPT accuracy by 1.58x compared to a state-of-the-art early-stopping approach.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"138 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Adaptive and Efficient GPU Time Sharing for Hyperparameter Tuning in Cloud\",\"authors\":\"L. Liu, Jian Yu, Zhijun Ding\",\"doi\":\"10.1145/3545008.3545027\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Hyperparameter tuning (HPT), which chooses a set of optimal hyperparameters for a learning algorithm, is critical to machine learning training. Unfortunately, the current resource provisioning approaches for HPT are unable to adjust resources adaptively according to the upward trends of HPT accuracy at runtime, resulting in low GPU utilization or HPT accuracy. On the other hand, dynamic resource provisioning approaches based on checkpointing are inefficient for HPT, because of high overhead of context switching and job restarting. This paper presents DISC, an adaptive and efficient HPT service with GPU time sharing for the cloud, which aims to improve GPU utilization and HPT accuracy. DISC provides a potential-aware GPU adaptive scaling to adjust the size of GPU time slices occupied by HPT jobs at runtime based on the upward trends of HPT accuracy. The dynamic allocation of GPU time slices is formalized as an optimization problem and tackled with an effective heuristic algorithm. Further, DISC achieves GPU memory temporal and spatial sharing according to the memory usage pattern of HPT jobs. It designs a time slice early release mechanism with relaxed PACK scheduling to improve memory utilization while avoiding memory overflow of the GPU due to time sharing. DISC is implemented upon the Kubeflow and Kubernetes ecosystem. We adopt a subset of Microsoft Philly Trace with public datasets to conduct evaluation. Experimental results show that DISC improves the average job completion time by 1.15x compared to the naïve approach and the HPT accuracy by 1.58x compared to a state-of-the-art early-stopping approach.\",\"PeriodicalId\":360504,\"journal\":{\"name\":\"Proceedings of the 51st International Conference on Parallel Processing\",\"volume\":\"138 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-08-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 51st International Conference on Parallel Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3545008.3545027\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 51st International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3545008.3545027","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Adaptive and Efficient GPU Time Sharing for Hyperparameter Tuning in Cloud
Hyperparameter tuning (HPT), which chooses a set of optimal hyperparameters for a learning algorithm, is critical to machine learning training. Unfortunately, the current resource provisioning approaches for HPT are unable to adjust resources adaptively according to the upward trends of HPT accuracy at runtime, resulting in low GPU utilization or HPT accuracy. On the other hand, dynamic resource provisioning approaches based on checkpointing are inefficient for HPT, because of high overhead of context switching and job restarting. This paper presents DISC, an adaptive and efficient HPT service with GPU time sharing for the cloud, which aims to improve GPU utilization and HPT accuracy. DISC provides a potential-aware GPU adaptive scaling to adjust the size of GPU time slices occupied by HPT jobs at runtime based on the upward trends of HPT accuracy. The dynamic allocation of GPU time slices is formalized as an optimization problem and tackled with an effective heuristic algorithm. Further, DISC achieves GPU memory temporal and spatial sharing according to the memory usage pattern of HPT jobs. It designs a time slice early release mechanism with relaxed PACK scheduling to improve memory utilization while avoiding memory overflow of the GPU due to time sharing. DISC is implemented upon the Kubeflow and Kubernetes ecosystem. We adopt a subset of Microsoft Philly Trace with public datasets to conduct evaluation. Experimental results show that DISC improves the average job completion time by 1.15x compared to the naïve approach and the HPT accuracy by 1.58x compared to a state-of-the-art early-stopping approach.