Sheng Yang, S. Khuller, Sunav Choudhary, S. Mitra, K. Mahadik
{"title":"在不可靠的现场实例上调度ML训练","authors":"Sheng Yang, S. Khuller, Sunav Choudhary, S. Mitra, K. Mahadik","doi":"10.1145/3492323.3495594","DOIUrl":null,"url":null,"abstract":"Cloud providers rent out surplus computational resources as spot instances at a deep discount. However, these cheap spot instances are revocable. When demand surges for higher priced on-demand instances, cloud providers can interrupt these spot instances after a brief alert. Such unreliability makes it challenging to utilize spot instances for many long-running jobs. However, with checkpoints and restoration, machine-learning (ML) training jobs are a good candidate to overcome this difficulty. In this paper, we formalize the problem of scheduling ML-training jobs on transient spot instances, especially from an ML researcher's view, who may have some grant/credit for renting cloud computing services for several ML training tasks. Such a researcher would need to partition the computational resources wisely to maximize outcome (or total expected utility of all jobs) while maintaining some fairness between jobs. We investigate the trade-off between low-cost/interruptible and high-cost/uninterruptible computation, by proposing a linear-programming (LP) rounding based polynomial time algorithm. Based on the LP solution, we also give an LP-based heuristic that performs well in practice. We implement and evaluate these algorithms, and are able to achieve the same utility with 23% to 48% of the budget needed with on-demand instances. Moreover, the total utility we get is close to the theoretical upper bound under various settings, indicating close to optimal performance.","PeriodicalId":440884,"journal":{"name":"Proceedings of the 14th IEEE/ACM International Conference on Utility and Cloud Computing Companion","volume":"27 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Scheduling ML training on unreliable spot instances\",\"authors\":\"Sheng Yang, S. Khuller, Sunav Choudhary, S. Mitra, K. Mahadik\",\"doi\":\"10.1145/3492323.3495594\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Cloud providers rent out surplus computational resources as spot instances at a deep discount. However, these cheap spot instances are revocable. When demand surges for higher priced on-demand instances, cloud providers can interrupt these spot instances after a brief alert. Such unreliability makes it challenging to utilize spot instances for many long-running jobs. However, with checkpoints and restoration, machine-learning (ML) training jobs are a good candidate to overcome this difficulty. In this paper, we formalize the problem of scheduling ML-training jobs on transient spot instances, especially from an ML researcher's view, who may have some grant/credit for renting cloud computing services for several ML training tasks. Such a researcher would need to partition the computational resources wisely to maximize outcome (or total expected utility of all jobs) while maintaining some fairness between jobs. We investigate the trade-off between low-cost/interruptible and high-cost/uninterruptible computation, by proposing a linear-programming (LP) rounding based polynomial time algorithm. Based on the LP solution, we also give an LP-based heuristic that performs well in practice. We implement and evaluate these algorithms, and are able to achieve the same utility with 23% to 48% of the budget needed with on-demand instances. Moreover, the total utility we get is close to the theoretical upper bound under various settings, indicating close to optimal performance.\",\"PeriodicalId\":440884,\"journal\":{\"name\":\"Proceedings of the 14th IEEE/ACM International Conference on Utility and Cloud Computing Companion\",\"volume\":\"27 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 14th IEEE/ACM International Conference on Utility and Cloud Computing Companion\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3492323.3495594\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 14th IEEE/ACM International Conference on Utility and Cloud Computing Companion","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3492323.3495594","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Scheduling ML training on unreliable spot instances
Cloud providers rent out surplus computational resources as spot instances at a deep discount. However, these cheap spot instances are revocable. When demand surges for higher priced on-demand instances, cloud providers can interrupt these spot instances after a brief alert. Such unreliability makes it challenging to utilize spot instances for many long-running jobs. However, with checkpoints and restoration, machine-learning (ML) training jobs are a good candidate to overcome this difficulty. In this paper, we formalize the problem of scheduling ML-training jobs on transient spot instances, especially from an ML researcher's view, who may have some grant/credit for renting cloud computing services for several ML training tasks. Such a researcher would need to partition the computational resources wisely to maximize outcome (or total expected utility of all jobs) while maintaining some fairness between jobs. We investigate the trade-off between low-cost/interruptible and high-cost/uninterruptible computation, by proposing a linear-programming (LP) rounding based polynomial time algorithm. Based on the LP solution, we also give an LP-based heuristic that performs well in practice. We implement and evaluate these algorithms, and are able to achieve the same utility with 23% to 48% of the budget needed with on-demand instances. Moreover, the total utility we get is close to the theoretical upper bound under various settings, indicating close to optimal performance.