SLAQ: quality-driven scheduling for distributed machine learning

Proceedings of the 2017 Symposium on Cloud Computing Pub Date : 2017-09-24 DOI:10.1145/3127479.3127490

Haoyu Zhang, Logan Stafman, Andrew Or, M. Freedman

引用次数: 120

Abstract

Training machine learning (ML) models with large datasets can incur significant resource contention on shared clusters. This training typically involves many iterations that continually improve the quality of the model. Yet in exploratory settings, better models can be obtained faster by directing resources to jobs with the most potential for improvement. We describe SLAQ, a cluster scheduling system for approximate ML training jobs that aims to maximize the overall job quality. When allocating cluster resources, SLAQ explores the quality-runtime trade-offs across multiple jobs to maximize system-wide quality improvement. To do so, SLAQ leverages the iterative nature of ML training algorithms, by collecting quality and resource usage information from concurrent jobs, and then generating highly-tailored quality-improvement predictions for future iterations. Experiments show that SLAQ achieves an average quality improvement of up to 73% and an average delay reduction of up to 44% on a large set of ML training jobs, compared to resource fairness schedulers.

查看原文本刊更多论文

SLAQ:分布式机器学习的质量驱动调度

使用大型数据集训练机器学习(ML)模型可能会在共享集群上引起严重的资源争用。这种训练通常涉及许多不断改进模型质量的迭代。然而，在探索性环境中，通过将资源引导到最有改进潜力的工作上，可以更快地获得更好的模型。我们描述了SLAQ，一个近似ML训练作业的集群调度系统，旨在最大化整体作业质量。在分配集群资源时，SLAQ探索跨多个作业的质量-运行时权衡，以最大限度地提高系统范围的质量。为此，SLAQ利用ML训练算法的迭代特性，从并发作业中收集质量和资源使用信息，然后为未来的迭代生成高度定制的质量改进预测。实验表明，与资源公平调度器相比，SLAQ在大量ML训练任务上实现了高达73%的平均质量改进和高达44%的平均延迟减少。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2017 Symposium on Cloud Computing

自引率

0.00%

发文量