Adaptive and Efficient GPU Time Sharing for Hyperparameter Tuning in Cloud

Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI:10.1145/3545008.3545027

L. Liu, Jian Yu, Zhijun Ding

{"title":"Adaptive and Efficient GPU Time Sharing for Hyperparameter Tuning in Cloud","authors":"L. Liu, Jian Yu, Zhijun Ding","doi":"10.1145/3545008.3545027","DOIUrl":null,"url":null,"abstract":"Hyperparameter tuning (HPT), which chooses a set of optimal hyperparameters for a learning algorithm, is critical to machine learning training. Unfortunately, the current resource provisioning approaches for HPT are unable to adjust resources adaptively according to the upward trends of HPT accuracy at runtime, resulting in low GPU utilization or HPT accuracy. On the other hand, dynamic resource provisioning approaches based on checkpointing are inefficient for HPT, because of high overhead of context switching and job restarting. This paper presents DISC, an adaptive and efficient HPT service with GPU time sharing for the cloud, which aims to improve GPU utilization and HPT accuracy. DISC provides a potential-aware GPU adaptive scaling to adjust the size of GPU time slices occupied by HPT jobs at runtime based on the upward trends of HPT accuracy. The dynamic allocation of GPU time slices is formalized as an optimization problem and tackled with an effective heuristic algorithm. Further, DISC achieves GPU memory temporal and spatial sharing according to the memory usage pattern of HPT jobs. It designs a time slice early release mechanism with relaxed PACK scheduling to improve memory utilization while avoiding memory overflow of the GPU due to time sharing. DISC is implemented upon the Kubeflow and Kubernetes ecosystem. We adopt a subset of Microsoft Philly Trace with public datasets to conduct evaluation. Experimental results show that DISC improves the average job completion time by 1.15x compared to the naïve approach and the HPT accuracy by 1.58x compared to a state-of-the-art early-stopping approach.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"138 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 51st International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3545008.3545027","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Hyperparameter tuning (HPT), which chooses a set of optimal hyperparameters for a learning algorithm, is critical to machine learning training. Unfortunately, the current resource provisioning approaches for HPT are unable to adjust resources adaptively according to the upward trends of HPT accuracy at runtime, resulting in low GPU utilization or HPT accuracy. On the other hand, dynamic resource provisioning approaches based on checkpointing are inefficient for HPT, because of high overhead of context switching and job restarting. This paper presents DISC, an adaptive and efficient HPT service with GPU time sharing for the cloud, which aims to improve GPU utilization and HPT accuracy. DISC provides a potential-aware GPU adaptive scaling to adjust the size of GPU time slices occupied by HPT jobs at runtime based on the upward trends of HPT accuracy. The dynamic allocation of GPU time slices is formalized as an optimization problem and tackled with an effective heuristic algorithm. Further, DISC achieves GPU memory temporal and spatial sharing according to the memory usage pattern of HPT jobs. It designs a time slice early release mechanism with relaxed PACK scheduling to improve memory utilization while avoiding memory overflow of the GPU due to time sharing. DISC is implemented upon the Kubeflow and Kubernetes ecosystem. We adopt a subset of Microsoft Philly Trace with public datasets to conduct evaluation. Experimental results show that DISC improves the average job completion time by 1.15x compared to the naïve approach and the HPT accuracy by 1.58x compared to a state-of-the-art early-stopping approach.

查看原文本刊更多论文

云环境下自适应高效GPU时间共享超参数调优

超参数调优(Hyperparameter tuning, HPT)是机器学习训练的关键，它为学习算法选择一组最优的超参数。遗憾的是，当前的HPT资源分配方法无法在运行时根据HPT精度的上升趋势自适应地调整资源，导致GPU利用率低或HPT精度低。另一方面，基于检查点的动态资源供应方法对于HPT来说效率很低，因为上下文切换和作业重新启动的开销很高。本文提出了一种基于GPU分时的自适应高效云计算HPT服务DISC，旨在提高GPU利用率和HPT精度。DISC提供了一种潜在的GPU自适应缩放，可以根据HPT精度的上升趋势，在运行时调整HPT作业占用的GPU时间片的大小。将GPU时间片的动态分配形式化为一个优化问题，并采用一种有效的启发式算法进行求解。此外，DISC根据HPT作业的内存使用模式实现GPU内存的时间和空间共享。它设计了一种时间片提前释放机制，具有宽松的PACK调度，以提高内存利用率，同时避免由于分时而导致GPU内存溢出。DISC是在Kubeflow和Kubernetes生态系统上实现的。我们采用带有公共数据集的Microsoft Philly Trace子集来进行评估。实验结果表明，与naïve方法相比，DISC平均作业完成时间提高了1.15倍，HPT精度比最先进的早期停止方法提高了1.58倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 51st International Conference on Parallel Processing

自引率

0.00%

发文量