Scheduling ML training on unreliable spot instances

Sheng Yang, S. Khuller, Sunav Choudhary, S. Mitra, K. Mahadik
{"title":"Scheduling ML training on unreliable spot instances","authors":"Sheng Yang, S. Khuller, Sunav Choudhary, S. Mitra, K. Mahadik","doi":"10.1145/3492323.3495594","DOIUrl":null,"url":null,"abstract":"Cloud providers rent out surplus computational resources as spot instances at a deep discount. However, these cheap spot instances are revocable. When demand surges for higher priced on-demand instances, cloud providers can interrupt these spot instances after a brief alert. Such unreliability makes it challenging to utilize spot instances for many long-running jobs. However, with checkpoints and restoration, machine-learning (ML) training jobs are a good candidate to overcome this difficulty. In this paper, we formalize the problem of scheduling ML-training jobs on transient spot instances, especially from an ML researcher's view, who may have some grant/credit for renting cloud computing services for several ML training tasks. Such a researcher would need to partition the computational resources wisely to maximize outcome (or total expected utility of all jobs) while maintaining some fairness between jobs. We investigate the trade-off between low-cost/interruptible and high-cost/uninterruptible computation, by proposing a linear-programming (LP) rounding based polynomial time algorithm. Based on the LP solution, we also give an LP-based heuristic that performs well in practice. We implement and evaluate these algorithms, and are able to achieve the same utility with 23% to 48% of the budget needed with on-demand instances. Moreover, the total utility we get is close to the theoretical upper bound under various settings, indicating close to optimal performance.","PeriodicalId":440884,"journal":{"name":"Proceedings of the 14th IEEE/ACM International Conference on Utility and Cloud Computing Companion","volume":"27 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 14th IEEE/ACM International Conference on Utility and Cloud Computing Companion","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3492323.3495594","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

Cloud providers rent out surplus computational resources as spot instances at a deep discount. However, these cheap spot instances are revocable. When demand surges for higher priced on-demand instances, cloud providers can interrupt these spot instances after a brief alert. Such unreliability makes it challenging to utilize spot instances for many long-running jobs. However, with checkpoints and restoration, machine-learning (ML) training jobs are a good candidate to overcome this difficulty. In this paper, we formalize the problem of scheduling ML-training jobs on transient spot instances, especially from an ML researcher's view, who may have some grant/credit for renting cloud computing services for several ML training tasks. Such a researcher would need to partition the computational resources wisely to maximize outcome (or total expected utility of all jobs) while maintaining some fairness between jobs. We investigate the trade-off between low-cost/interruptible and high-cost/uninterruptible computation, by proposing a linear-programming (LP) rounding based polynomial time algorithm. Based on the LP solution, we also give an LP-based heuristic that performs well in practice. We implement and evaluate these algorithms, and are able to achieve the same utility with 23% to 48% of the budget needed with on-demand instances. Moreover, the total utility we get is close to the theoretical upper bound under various settings, indicating close to optimal performance.
在不可靠的现场实例上调度ML训练
云提供商以极低的折扣出租剩余的计算资源作为现货实例。然而,这些便宜的现货实例是可撤销的。当对价格更高的按需实例的需求激增时,云提供商可以在发出简短警报后中断这些现货实例。这种不可靠性使得在许多长时间运行的作业中使用现场实例具有挑战性。然而,通过检查点和恢复,机器学习(ML)培训工作是克服这一困难的一个很好的选择。在本文中,我们形式化了在临时现场实例上调度机器学习训练任务的问题,特别是从机器学习研究人员的角度来看,他们可能有一些授权/信用,可以为几个机器学习训练任务租用云计算服务。这样的研究人员需要明智地划分计算资源,以最大化结果(或所有作业的总预期效用),同时保持作业之间的一些公平性。我们通过提出一种基于线性规划(LP)舍入的多项式时间算法来研究低成本/可中断计算和高成本/不可中断计算之间的权衡。在LP解决方案的基础上,给出了一种基于LP的启发式算法,在实践中表现良好。我们实现并评估了这些算法,并且能够以按需实例所需预算的23%到48%实现相同的效用。此外,在各种设置下,我们得到的总效用都接近理论上限,表明接近最优性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信