Online scheduling of heterogeneous distributed machine learning jobs

Qin Zhang, Ruiting Zhou, Chuan Wu, Lei Jiao, Zongpeng Li
{"title":"Online scheduling of heterogeneous distributed machine learning jobs","authors":"Qin Zhang, Ruiting Zhou, Chuan Wu, Lei Jiao, Zongpeng Li","doi":"10.1145/3397166.3409128","DOIUrl":null,"url":null,"abstract":"Distributed machine learning (ML) has played a key role in today's proliferation of AI services. A typical model of distributed ML is to partition training datasets over multiple worker nodes to update model parameters in parallel, adopting a parameter server architecture. ML training jobs are typically resource elastic, completed using various time lengths with different resource configurations. A fundamental problem in a distributed ML cluster is how to explore the demand elasticity of ML jobs and schedule them with different resource configurations, such that the utilization of resources is maximized and average job completion time is minimized. To address it, we propose an online scheduling algorithm to decide the execution time window, the number and the type of concurrent workers and parameter servers for each job upon its arrival, with a goal of minimizing the weighted average completion time. Our online algorithm consists of (i) an online scheduling framework that groups unprocessed ML training jobs into a batch iteratively, and (ii) a batch scheduling algorithm that configures each ML job to maximize the total weight of scheduled jobs in the current iteration. Our online algorithm guarantees a good parameterized competitive ratio with polynomial time complexity. Extensive evaluations using real-world data demonstrate that it outperforms state-of-the-art schedulers in today's AI cloud systems.","PeriodicalId":122577,"journal":{"name":"Proceedings of the Twenty-First International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Twenty-First International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3397166.3409128","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 14

Abstract

Distributed machine learning (ML) has played a key role in today's proliferation of AI services. A typical model of distributed ML is to partition training datasets over multiple worker nodes to update model parameters in parallel, adopting a parameter server architecture. ML training jobs are typically resource elastic, completed using various time lengths with different resource configurations. A fundamental problem in a distributed ML cluster is how to explore the demand elasticity of ML jobs and schedule them with different resource configurations, such that the utilization of resources is maximized and average job completion time is minimized. To address it, we propose an online scheduling algorithm to decide the execution time window, the number and the type of concurrent workers and parameter servers for each job upon its arrival, with a goal of minimizing the weighted average completion time. Our online algorithm consists of (i) an online scheduling framework that groups unprocessed ML training jobs into a batch iteratively, and (ii) a batch scheduling algorithm that configures each ML job to maximize the total weight of scheduled jobs in the current iteration. Our online algorithm guarantees a good parameterized competitive ratio with polynomial time complexity. Extensive evaluations using real-world data demonstrate that it outperforms state-of-the-art schedulers in today's AI cloud systems.
异构分布式机器学习作业的在线调度
分布式机器学习(ML)在当今人工智能服务的激增中发挥了关键作用。分布式机器学习的典型模型是采用参数服务器架构,将训练数据集划分到多个工作节点上并行更新模型参数。机器学习培训工作通常是资源弹性的,使用不同的时间长度和不同的资源配置来完成。分布式机器学习集群的一个基本问题是如何探索机器学习作业的需求弹性,并使用不同的资源配置来调度它们,从而使资源利用率最大化,平均作业完成时间最小化。为了解决这一问题,我们提出了一种在线调度算法,以最小化加权平均完成时间为目标,确定每个作业到达时的执行时间窗口、并发工人数量和类型以及参数服务器。我们的在线算法包括(i)一个在线调度框架,它将未处理的机器学习训练任务迭代地分组到一个批处理中,以及(ii)一个批调度算法,它配置每个机器学习任务,以在当前迭代中最大化计划任务的总权重。我们的在线算法保证了具有多项式时间复杂度的良好参数化竞争比。使用真实世界数据的广泛评估表明,它优于当今人工智能云系统中最先进的调度器。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信