Scheduling ML training on unreliable spot instances

Proceedings of the 14th IEEE/ACM International Conference on Utility and Cloud Computing Companion Pub Date : 2021-12-06 DOI:10.1145/3492323.3495594

Sheng Yang, S. Khuller, Sunav Choudhary, S. Mitra, K. Mahadik

{"title":"Scheduling ML training on unreliable spot instances","authors":"Sheng Yang, S. Khuller, Sunav Choudhary, S. Mitra, K. Mahadik","doi":"10.1145/3492323.3495594","DOIUrl":null,"url":null,"abstract":"Cloud providers rent out surplus computational resources as spot instances at a deep discount. However, these cheap spot instances are revocable. When demand surges for higher priced on-demand instances, cloud providers can interrupt these spot instances after a brief alert. Such unreliability makes it challenging to utilize spot instances for many long-running jobs. However, with checkpoints and restoration, machine-learning (ML) training jobs are a good candidate to overcome this difficulty. In this paper, we formalize the problem of scheduling ML-training jobs on transient spot instances, especially from an ML researcher's view, who may have some grant/credit for renting cloud computing services for several ML training tasks. Such a researcher would need to partition the computational resources wisely to maximize outcome (or total expected utility of all jobs) while maintaining some fairness between jobs. We investigate the trade-off between low-cost/interruptible and high-cost/uninterruptible computation, by proposing a linear-programming (LP) rounding based polynomial time algorithm. Based on the LP solution, we also give an LP-based heuristic that performs well in practice. We implement and evaluate these algorithms, and are able to achieve the same utility with 23% to 48% of the budget needed with on-demand instances. Moreover, the total utility we get is close to the theoretical upper bound under various settings, indicating close to optimal performance.","PeriodicalId":440884,"journal":{"name":"Proceedings of the 14th IEEE/ACM International Conference on Utility and Cloud Computing Companion","volume":"27 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 14th IEEE/ACM International Conference on Utility and Cloud Computing Companion","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3492323.3495594","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Cloud providers rent out surplus computational resources as spot instances at a deep discount. However, these cheap spot instances are revocable. When demand surges for higher priced on-demand instances, cloud providers can interrupt these spot instances after a brief alert. Such unreliability makes it challenging to utilize spot instances for many long-running jobs. However, with checkpoints and restoration, machine-learning (ML) training jobs are a good candidate to overcome this difficulty. In this paper, we formalize the problem of scheduling ML-training jobs on transient spot instances, especially from an ML researcher's view, who may have some grant/credit for renting cloud computing services for several ML training tasks. Such a researcher would need to partition the computational resources wisely to maximize outcome (or total expected utility of all jobs) while maintaining some fairness between jobs. We investigate the trade-off between low-cost/interruptible and high-cost/uninterruptible computation, by proposing a linear-programming (LP) rounding based polynomial time algorithm. Based on the LP solution, we also give an LP-based heuristic that performs well in practice. We implement and evaluate these algorithms, and are able to achieve the same utility with 23% to 48% of the budget needed with on-demand instances. Moreover, the total utility we get is close to the theoretical upper bound under various settings, indicating close to optimal performance.

查看原文本刊更多论文

在不可靠的现场实例上调度ML训练

云提供商以极低的折扣出租剩余的计算资源作为现货实例。然而，这些便宜的现货实例是可撤销的。当对价格更高的按需实例的需求激增时，云提供商可以在发出简短警报后中断这些现货实例。这种不可靠性使得在许多长时间运行的作业中使用现场实例具有挑战性。然而，通过检查点和恢复，机器学习(ML)培训工作是克服这一困难的一个很好的选择。在本文中，我们形式化了在临时现场实例上调度机器学习训练任务的问题，特别是从机器学习研究人员的角度来看，他们可能有一些授权/信用，可以为几个机器学习训练任务租用云计算服务。这样的研究人员需要明智地划分计算资源，以最大化结果(或所有作业的总预期效用)，同时保持作业之间的一些公平性。我们通过提出一种基于线性规划(LP)舍入的多项式时间算法来研究低成本/可中断计算和高成本/不可中断计算之间的权衡。在LP解决方案的基础上，给出了一种基于LP的启发式算法，在实践中表现良好。我们实现并评估了这些算法，并且能够以按需实例所需预算的23%到48%实现相同的效用。此外，在各种设置下，我们得到的总效用都接近理论上限，表明接近最优性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 14th IEEE/ACM International Conference on Utility and Cloud Computing Companion

自引率

0.00%

发文量