Online scheduling of heterogeneous distributed machine learning jobs

Proceedings of the Twenty-First International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing Pub Date : 2020-10-07 DOI:10.1145/3397166.3409128

Qin Zhang, Ruiting Zhou, Chuan Wu, Lei Jiao, Zongpeng Li

{"title":"Online scheduling of heterogeneous distributed machine learning jobs","authors":"Qin Zhang, Ruiting Zhou, Chuan Wu, Lei Jiao, Zongpeng Li","doi":"10.1145/3397166.3409128","DOIUrl":null,"url":null,"abstract":"Distributed machine learning (ML) has played a key role in today's proliferation of AI services. A typical model of distributed ML is to partition training datasets over multiple worker nodes to update model parameters in parallel, adopting a parameter server architecture. ML training jobs are typically resource elastic, completed using various time lengths with different resource configurations. A fundamental problem in a distributed ML cluster is how to explore the demand elasticity of ML jobs and schedule them with different resource configurations, such that the utilization of resources is maximized and average job completion time is minimized. To address it, we propose an online scheduling algorithm to decide the execution time window, the number and the type of concurrent workers and parameter servers for each job upon its arrival, with a goal of minimizing the weighted average completion time. Our online algorithm consists of (i) an online scheduling framework that groups unprocessed ML training jobs into a batch iteratively, and (ii) a batch scheduling algorithm that configures each ML job to maximize the total weight of scheduled jobs in the current iteration. Our online algorithm guarantees a good parameterized competitive ratio with polynomial time complexity. Extensive evaluations using real-world data demonstrate that it outperforms state-of-the-art schedulers in today's AI cloud systems.","PeriodicalId":122577,"journal":{"name":"Proceedings of the Twenty-First International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Twenty-First International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3397166.3409128","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 14

Abstract

Distributed machine learning (ML) has played a key role in today's proliferation of AI services. A typical model of distributed ML is to partition training datasets over multiple worker nodes to update model parameters in parallel, adopting a parameter server architecture. ML training jobs are typically resource elastic, completed using various time lengths with different resource configurations. A fundamental problem in a distributed ML cluster is how to explore the demand elasticity of ML jobs and schedule them with different resource configurations, such that the utilization of resources is maximized and average job completion time is minimized. To address it, we propose an online scheduling algorithm to decide the execution time window, the number and the type of concurrent workers and parameter servers for each job upon its arrival, with a goal of minimizing the weighted average completion time. Our online algorithm consists of (i) an online scheduling framework that groups unprocessed ML training jobs into a batch iteratively, and (ii) a batch scheduling algorithm that configures each ML job to maximize the total weight of scheduled jobs in the current iteration. Our online algorithm guarantees a good parameterized competitive ratio with polynomial time complexity. Extensive evaluations using real-world data demonstrate that it outperforms state-of-the-art schedulers in today's AI cloud systems.

查看原文本刊更多论文

异构分布式机器学习作业的在线调度

分布式机器学习(ML)在当今人工智能服务的激增中发挥了关键作用。分布式机器学习的典型模型是采用参数服务器架构，将训练数据集划分到多个工作节点上并行更新模型参数。机器学习培训工作通常是资源弹性的，使用不同的时间长度和不同的资源配置来完成。分布式机器学习集群的一个基本问题是如何探索机器学习作业的需求弹性，并使用不同的资源配置来调度它们，从而使资源利用率最大化，平均作业完成时间最小化。为了解决这一问题，我们提出了一种在线调度算法，以最小化加权平均完成时间为目标，确定每个作业到达时的执行时间窗口、并发工人数量和类型以及参数服务器。我们的在线算法包括(i)一个在线调度框架，它将未处理的机器学习训练任务迭代地分组到一个批处理中，以及(ii)一个批调度算法，它配置每个机器学习任务，以在当前迭代中最大化计划任务的总权重。我们的在线算法保证了具有多项式时间复杂度的良好参数化竞争比。使用真实世界数据的广泛评估表明，它优于当今人工智能云系统中最先进的调度器。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the Twenty-First International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing

自引率

0.00%

发文量